This homework focuses on training and evaluating prediction models for a particular problem and dataset. The data comes from the Centers for Disease Control and Prevention (CDC: https://covid.cdc.gov/covid-data-tracker/). CDC is a USA health protection agency and is in charge of collecting data about the COVID-19 pandemic, and in particular, tracking cases, deaths, and trends of COVID-19 in the United States. CDC collects and makes public deidentified individual-case data on a daily basis, submitted using standardized case reporting forms. In this analysis, we focus on using the data collected by CDC to build a data analytics solution for death risk prediction.
The dataset we work with is a sample of the public data released by CDC, where the outcome for the target feature death_yn is known (i.e., either 'yes' or 'no'): https://data.cdc.gov/Case-Surveillance/COVID-19-Case-Surveillance-Public-Use-Data/vbim-akqf
The goal in this homework is to work with the data to build and evaluate prediction models that capture the relationship between the descriptive features and the target feature death_yn. For this homework you are asked to use the same dataset allocated to you in Homework1 (you can use your cleaned/prepared CSV from Homework1 or start from the raw dataset, clean it according to concepts covered in the lectures/labs, then use it for training prediction models).
There are 5 parts for this homework. Each part has an indicative maximum percentage given in brackets, e.g., part (1) has a maximum of 25% shown as [25]. The total that can be achieved is 100.
(1). [25] Data Understanding and Preparation: Exploring relationships between feature pairs and selecting/transforming promising features based on a given training set.
- (1.1) Randomly shuffle the rows of your dataset and split the dataset into two datasets: 70% training and 30% test. Keep the test set aside. For shuffling, please remember to set the random state so the split is always the same, this helps with reproducing and verifying your results.
- (1.2) On the training set:
- Plot the correlations between all the continuous features (if any). Discuss what you observe in these plots.
- For each continuous feature, plot its interaction with the target feature (a plot for each pair of continuous feature and target feature). Discuss what you observe from these plots, e.g., which continuous features seem to be better at predicting the target feature? Choose a subset of continuous features you find promising (if any). Justify your choices.
- For each categorical feature, plot its pairwise interaction with the target feature. Discuss what knowledge you gain from these plots, e.g., which categorical features seem to be better at predicting the target feature? Choose a subset of categorical features you find promising (if any). Justify your choices.
(2). [15] Predictive Modeling: Linear Regression.
- (2.1) On the training set, train a linear regression model to predict the target feature, using only the descriptive features selected in exercise (1) above.
- (2.2) Print the coefficients learned by the model and discuss their role in the model (e.g., interpret the model by analysing each coefficient and how it relates each input feature to the target feature).
- (2.3) Print the predicted target feature value for the first 10 training examples. Threshold the predicted target feature value given by the linear regression model at 0.5, to get the predicted class for each example. Print the predicted class for the first 10 examples. Print a few classification evaluation measures computed on the full training set (e.g., Accuracy, Confusion matrix, Precision, Recall, F1) and discuss your findings so far.
- (2.4) Evaluate the model using classification evaluation measures on the hold-out (30% examples) test set. Compare these results with the evaluation results obtained on the training (70%) dataset. Also compare these results with a cross-validated model (i.e., a new model trained and evaluated using cross-validation on the full dataset). You can use classic k-fold cross-validation or repeated random train/test (70/30) splits. Compare the cross-validation metrics to those obtained on the single train/test split and discuss your findings.
(3). [15] Predictive Modeling: Logistic Regression.
- (3.1) On the training set, train a logistic regression model to predict the target feature, using the descriptive features selected in exercise (1) above.
- (3.2) Print the coefficients learned by the model and discuss their role in the model (e.g., interpret the model).
- (3.3) Print the predicted target feature value for the first 10 training examples. Print the predicted class for the first 10 examples. Print a few classification evaluation measures computed on the full training set (e.g., Accuracy, Confusion matrix, Precision, Recall, F1) and discuss your findings so far.
- (3.4) Evaluate the model using classification evaluation measures on the hold-out (30% examples) test set. Compare these results with the evaluation results obtained when using the training (70%) dataset for evaluation. Also compare these results with a cross-validated model (i.e., a new model trained and evaluated using cross-validation on the full dataset). You can use classic k-fold cross-validation or repeated train/test (70/30) splits. Compare the cross-validation metrics to those obtained on the single train/test split and discuss your findings.
(4). [20] Predictive Modeling: Random Forest.
- (3.1) On the training set, train a random forest model to predict the target feature, using the descriptive features selected in exercise (1) above.
- (3.2) Can you interpret the random forest model? Discuss any knowledge you can gain in regard of the working of this model.
- (3.3) Print the predicted target feature value for the first 10 training examples. Print the predicted class for the first 10 examples. Print a few classification evaluation measures computed on the full training set (e.g., Accuracy, Confusion matrix, Precision, Recall, F1) and discuss your findings so far.
- (3.4) Evaluate the model using classification evaluation measures on the hold-out (30% examples) test set. Compare these results with the evaluation results obtained when using the training (70%) dataset for evaluation. Also compare these results with a cross-validated model (i.e., a new model trained and evaluated using cross-validation on the full dataset). You can use classic k-fold cross-validation or repeated train/test (70/30) splits. Compare the cross-validation metrics to those obtained on the single train/test split and to the out-of-sample error and discuss your findings.
(5). [25] Improving Predictive Models.
- (4.1) Which model of the ones trained above performs better at predicting the target feature? Is it more accurate than a simple model that always predicts the majority class (i.e., if 'no' is the majority class in your dataset, the simple model always predicts 'no' for the target feature)? Justify your answers.
- (4.2) Summarise your understanding of the problem and of your predictive modeling results so far. Can you think of any new ideas to improve the best model so far (e.g., by using furher data prep such as: feature selection, feature re-scaling, creating new features, combining models, or using other knowledge)? Please show how your ideas actually work in practice, by training and evaluating your proposed models. Summarise your findings so far.
- (4.3) Take your best model trained and selected based on past data (ie your cleaned Homework1 dataset), and evaluate it on the new test dataset provided with this homework (in file '24032021-covid19-cdc-deathyn-recent-10k.csv'). Discuss your findings.
Author: ARyan - 14395076
Module: COMP47350
DC: 2021-02-08
DLM: 2021-04-28
Desc: This file builds upon my analysis of the COVID19 data set and produces models to predict death.
Dict: The Data Dictionary for the Data Set is available at: https://www.cdc.gov/coronavirus/2019-ncov/downloads/data-dictionary.pdf
- Homework 1:
Introduction
Data Quality Report
Data Quality Plan
Extension Commentary and Analysis
Homework 2:
Exploratory Analysis
Model Creation and Analysis
Model Extension and Refinement
-
COVID-19 is an infectious disease caused by SARS-CoV-2, a coronavirus strain discovered in December 2019 first identified following an outbreak in the Chinese city Wuhan, with the WHO declaring the outbreak a global pandemic in March 2020.
Since its discovery, health organisations have been actively gathering data to assess aspects of the disease including infectivity, symptoms, and mortality rate. Active interest has been paid to factors which may increase a patient's risk of serious symptons or death.
In this analysis, we focus on using the data collected by CDC to build an analytics solution for predicting a patients' death risk prediction. CDC collects demographic characteristics, exposure history, disease severity indicators and outcomes, clinical data, laboratory diagnostic test results, and comorbidities. It also includes information on whether the individual survived or not.
We wish to develop a model to predict the risk of a patient dying based on various metrics collected by the CDC.
The CDC collects demographic data, exposure history, disease severity indicators, outcomes, clinical data, comorbidities, and whether the patient survived. The full data dictionary provided by the CDC is available at the following location: https://www.cdc.gov/coronavirus/2019-ncov/downloads/data-dictionary.pdf
For this assignment, a sample of ten thousand rows are provided from the full dataset available from: https://covid.cdc.gov/covid-data-tracker/
The assignment was broadly approached as follows but these were non-hard boundaries:
As requested in the exercise, the key findings are prepared within the Notebook File and accompanying PDFs.
COVID-19 is an infectious disease caused by SARS-CoV-2, a coronavirus strain discovered in December 2019 first identified following an outbreak in the Chinese city Wuhan, with the WHO declaring the outbreak a global pandemic in March 2020.
Since its discovery, health organisations have been actively gathering data to assess aspects of the disease including infectivity, symptoms, and mortality rate. Active interest has been paid to factors which may increase a patient's risk of serious symptons or death.
In this analysis, we focus on using the data collected by CDC to build an analytics solution for predicting a patients' death risk prediction. CDC collects demographic characteristics, exposure history, disease severity indicators and outcomes, clinical data, laboratory diagnostic test results, and comorbidities. It also includes information on whether the individual survived or not.
This report will outline the initial findings based on the provided sample of the CDC dataset. It will summarise the data, describe the various data quality issues observed and how they will be addressed.
Appendix includes terminology, assumptions, explanations and summary of changes made to the original dataset. This also includes feature summaries and boxplots used to visualise the data.
The following are the key points in relation to the data set and approach:
The dataset lacks a primary key.
The dataset lacks a patient identifier so we cannot look for readmitted patients.
The dataset consists of 10,000 rows and 12 (non-repeated) columns.
While null values are largely absent from the dataset, there are high proportions of values flagged as 'missing' and 'unknown', with some features containing both missing and unknown values. The distinction between missing and unknown should be confirmed with a source knowledgeable on the data set, however the author's initial recommendation is that these features would be likely targets for imputation mapping both features to a single 'unknown' value.
The datetime columns are most heavily affected by null or missing values. The author notes that the cdc data dictionary highlights the depreciation of the cdc_report_dt column and points to the usage of cdc_case_earliest_dt in its place. Following this, the author recommends removing the now depreciated cdc_report_dt column.
The volume of duplicate rows is low at 431 rows (4.3%). Investigation into the cause of duplicates highlights sparse population of data, or common data population, is the primary cause of duplicates (e.g. racial info is missing in 90% of duplicate instances, with icu and medical condition info missing in over 95% of duplicate instances. Although these instances are likely 'valid', the recommendation is to drop these duplicate instances as the high prevalence of missing information is unlikely to provide useful information into our model.
The features prevalent are good targets for conversion to a 'category' datatype with limited valid values prevelant across all category features.
There is one record where there is an icu admission flagged but not a hospital admission. This record should be removed due to inconsistency in the data and low impact on overall set.
The current_status column contains 93% Laboratory confirmed cases. It should be identified with a domain expert as to whether the probable cases must be considered. If the probable cases can be dropped, the recommendation is to remoe the probable cases and remove this feature however this will be included in further components of the analysis.
As the dataset has a heavy focus on categorical data, the following tests were carried out to asses the integrity of the dataset
T1: Check if there are cdc_case_earliest_dt's which are not the earliest of the other dates Result: 2857 (29%) Records which are not the earliest Result: 515 (5%) Records which are not the earliest where not all of the other dates are populated
*Query: Where does this data come from?
T2: Check if there are ICU admissions without hospital admissions. *Result: 1 Record which should be updated
T3: Check if there are probable cases with a confirmed positive specimen Result: 227 Record which should be updated to laboratory confirmed Result: 248 Records when hospital admission is also true.
There are 8 non-categorical features in the dataset:
F1: current_status - A feature to flag if the case is confirmed via lab or suspected.
* Null: Not applicable.
* Top Value: Laboratory Confirmed Case - 93% of rows.
* Unique Values: 2
* Overall data is reasonable. Actionable item to update probable cases where there is a positive lab specimen.
F2: sex - A feature to flag the patients' sex.
* Null: Not applicable.
* Top Value: Female 53%
* Unique Values: 4
* Should be updated to combine unknown values
F3: age_group - A feature to flag the patients' age group.
* Null: Not applicable.
* Top Value: 20-29 Years 18%
* Unique Values: 10
* 14 records have an unknown age grouping.
F4: race_ethnicity_combined - A feature to flag the patients' ethnicity.
* Null: Not applicable.
* Top Value: Unknown 41%
* Unique Values: 10
* 41% unknown values.
* Contatenated field with comma separated values. Separation denotes Hispanic or not. This info is already captured via the racial component.
F5: hosp_yn - A feature to flag if the patient was hospitalised.
* Null: Not applicable.
* Top Value: No 52%
* Unique Values: 5
* Missing and unknown two separate values. OTH present in one record.
F6: icu_yn - A feature to flag if the patient was admitted to ICU
* Null: Not applicable.
* Top Value: Missing 77%
* Unique Values: 4
* Check with domain expecrt on missing % reason. Are missing values indicative that the patient never ended up in the ICU and hence it was not flagged? Initial investigation suggsts that Missing Corresponds with No. Note in particular that for younger patients are more heavily represented as a percentage of their age group within the missing category, and similarly older patients are more likely to be represented in the no category than younger patients as a proportion of their age group (something that would appear contradictory). My initial recommendation would be to populate this with 'no' where it is missing, however I would leave any population of the value as the final actionable step so that the ML model can be easily tested with and without this value to decide on a sensible approach. I suspect that older patients are flagged explicitly as being non-ICU patients as there might be more concern over it being needed resulting in an almost skewing of the value.
F7: medcond_yn - A feature to flag if the patient had comorbidities.
* Null: Not applicable.
* Top Value: Missing 75%
* Unique Values: 4
* 82% unknown and missing values.
Target Feature: death_yn - A feature to flag if the patient died.
* Populated. 3% are yes.
There are 4 categorical datetime features in the dataset:
D1: cdc_case_earliest_dt - A feature to flag if the case is confirmed via lab or suspected.
* Null: Not applicable.
* Top Value: Laboratory Confirmed Case - 93% of rows.
* Unique Values: 2
* Overall data is reasonable. Actionable item to update probable cases where there is a positive lab specimen.
*Covers 325 2nd January 2020 to 16th January 2021 (missing days present)
D2: cdc_report_dt - A depreciated column. CDC recommendation is to drop for D1.
* Should be dropped due to depreciation.
D3: pos_spec_dt - First positive specimen collected
* Null: Yes 72% missing
* Rec: Use to update Status and drop as missing percentage too high.
D4: onset_dt - Date of symptom onset
* Null: Yes 49% missing.
* Top Value: Unknown 41%
* Unique Values: 326
* 41% unknown values.
* Keep for determining time between reporting and symptom onset.
*Covers 2nd January 2020 to 28th January 2021.
BoxPlots were produced for all categorical data. These are present in the appendix due to the size of the file. All pairs of data and single value info was calculated as an initial exploration.
The steps provided in the assignment outline more of a linearisation in the process, however upon reviewing the data I did not believe the outlined processed was particularly suitable for this dataset.
In particular, the processing steps outlined suggest the removal of duplicate values prior to data exploration. As I did not beleive the records were, in fact, duplicates but instead were driven by other elements, it was more reasonable to explore the relationships between various factors before taking any steps to drop rows with overlap, in order to better understand why.
Similarly, the steps provided suggest not adding columns until the final section. Due to the nature of the data and the variety of missing values within some of the indicator and date columns, it seemed to me that valuable information could be obtained based on my initial exploration before any final removal occurs. In particular, the onset datetime column looks to have key value in relation to the asymptomatic prevalence of COVID and the time between initial presentation and symptom onset date. Therefore, adjusting the nature of this column and adding on attributes which reflected the data that was in the original column while preserving and enhancing the data set was logical as an approach before simply dropping this feature for missing prevalency. Similarly, the race column contains race and ethnicity combined however this can be replaced with the racial info as that alone is sufficient to capture the concatenated nature of this. While there may be a need from a reporting purpose in the CDC to compare Hispanic vs Non-Hispanics demographics, reducing the memory usage of the field by stripping the redundant info still allows recovery if this would be insightful.
Due to all of the above, the data quality plan and data quality actioning were, in a sense, completed as a joint process as proper cleansing of the set did not allow for a full linearisation of this process. This steps is detailed below.
Based on the initial insights, the following is the data quality plan. Full details on reasoning have been already outlined in the data quality report.
A key note is the author wishes to avoid dropping data as an intermediate step unless necessary or directly contradictory data. Acquisition cost of data is too significant to justify dropping data until a step just prior to usage in ML models as retrieval can be challenging. As such, data is being imputed into missing values in general. The Data Action Dictionary is:
data_action_dictionary=
{
'cdc_case_earliest_dt':
{
"Data Quality Issues": "515 Rows where not minimum of other dates populated"
,"Data Quality Actions": "Confirm reason. Otherwise leave as-is"
}
,'cdc_report_dt':
{
"Data Quality Issues": "Depreciated"
,"Data Quality Actions":"Drop"
}
,'pos_spec_dt':
{
"Data Quality Issues":"72% of data missing"
,"Data Quality Actions":"Drop after using for status correction"
}
,'onset_dt':
{
"Data Quality Issues":"49% of Data Missing. <1% of dates where onset_dt is too far after case date."
,"Data Quality Actions":"Split into days since symptom. Flag missing data. Drop column. Statistically relevant. Enquire on why some values are so extreme after earliest date"
}
,'current_status':
{
"Data Quality Issues": "Probable Cases that should be Laboratory Confirmed Cases"
,"Data Quality Actions":"Update instances"
}
,'sex':
{
"Data Quality Issues": "Missing and Unknown flags"
,"Data Quality Actions": "Bin into Unknown category"
}
,'age_group':
{
"Data Quality Issues": "Missing and Unknown flags"
,"Data Quality Actions":"Bin into groups"
}
,'race_ethnicity_combined':
{
"Data Quality Issues":"Concatenated field. Race sufficient to capture all info."
,"Data Quality Actions":"Split field and drop ethnicity"
}
,'hosp_yn':
{
"Data Quality Issues":"Missing, Unknown, and OTH values"
,"Data Quality Actions":"Bin unknown into groups"
}
,'icu_yn':
{
"Data Quality Issues":"Missing data 72%."
,"Data Quality Actions":"Determine if missing because 'no'. Column is relevant so await answer before dropping"
}
,'death_yn':
{
"Data Quality Issues":"Not applicable"
,"Data Quality Actions":"No action"
}
,'medcond_yn':
{
"Data Quality Issues":"80% missing"
,"Data Quality Actions":"Grouping missing consistently. Column is relevant so keep until answer on cause of missing values"
}
}
A key note is the author wishes to avoid dropping data as an intermediate step unless necessary or directly contradictory data. Acquisition cost of data is too significant to justify dropping data until a step just prior to usage in ML models as retrieval can be challenging. As such, data is being imputed into missing values in general.
I elected to pair and plot all combinations of features within the dataset.
To extend the set, I created day, month, year, and workday featuers for the cdc_case date. This was primarily to help determine if cases followed any trend in terms of timing in the week or month or year which could be insightful. Confounding factors could be if certain areas operate on a rotating staff basis, then potentially trends in deaths could point for further area to investigate.
Adding on to my earlier analysis and inspection, I changed the onset date into a column highlighting the number of days after diagnosis that symptoms appeared. My initial hypothesis surrounding this is that individuals who got tested and did not become symptomatic until later would have had a better expected outcome due to earlier intervention and treatment management and that this could have predictive power in determining if a patient was at risk of dying.
Finally, I added flags for whether demographic or medical data was missing for a particular record. Although I personally wished to avoid removing duplicates until it becomes fed into an ML model, it will be necessary to experiment with different featuers being present or absent given the high quantity of missing values within the dataset. These flags are to provide a convenient way to filter the dataset and focus on the rows where a full dataset is present if needed.
For the purpose of analysing pairs of featurs, beyond some of the analysis already conducted, I am electing to focus on plotting the target feature death_yn verssu other cathegorical columns. Other features of interest may be pbriefly discussed however the year focus will be on the death_yn feature against others. Unfortunately as the data is primarily categorical, the analysis focuses primarily on feature distribution.
Key Points:
death vs age highlights an increasing relative proportion of deaths among patients in the older cohort, rising from a minor factor among 30-39 years to occupying an increased proportion as the age increases suggested this factor is likely an indicator which should be factored into our model.
death vs hosp_yn and icu_yn : Among those with hosp_yn = True and/or icu_yn=True, the death rate is similarly significantly larger in relative likelihood compared to other attribute combinations.
month vs death Interestingly over time the deaths per moment as a proportion they've occupied has decreased with February particularly high in terms of the proportion of deaths and this proportion decreases from March onwards.
medcond vs death Interestingly although the comorbidity True instances features an elevated proportion of Death, the increase was not as significant as I had expected.
race vs Death American Indian and Black individuals in the sample set are dying at a higher proportion than other races. Given that the context of this dataset is from the CDC and focuses on the American healthcare system, where wealth is a significant factor in terms of quality of treatment and racial biases are likely to leave minorities to be at an economic disadvantage, this is potentially a contributing element.
day vs death People whose CDC Earliest date is The 24th and 25th day in the month are most likely to die over the course of treatment. Although the impact of this is not as significant as other factors
One area of note is that a 100% stacked bar chart diminishes the importance of how prevalent features actually area and fails to saccount for how single instances are more likely to have an impact on a feature of smaller size. Due to this, the barcharts which are stacked can give a highly misleading view of the data (although can be useful to gain a perspective on relevant factors to our model) and should be considered in relation to the stacked (but not 100% stacked) bar charts produced continuously within the report.
Homework 2:
Following the splitting of our dataset into train and test features, we arrive at the below:
Probable cases interestingly are slightly more like than laboratory confirmed cases to have resulted in death. Suspicion is that this is due to retrospective classification of this data. Overall probably case volume is small (>5% of data) so this is likely not a significant indicator for future data.
Males are more likely to be flagged as death in the training data set. There is potentially a higher risk for males. A key factor as to why this might be is that the life expectancy for males in the US is lower than females. Particularly in the older categories, males might be at a more pronounced risk due to overall lower life expectancy resulting in a greater susceptibility to COVID. Ultimately, the differences are relatively minor by sex, so this is unlikely to have a significant sway within our model given that there are some features (e.g. age, icu, medcond, hosp) which have a more significant correlation.
Age group is a highly significant factor. From 40+, there is a greatly increased likelhiood of death. This aligns highly with what is currently known about COVID, where people in older categories are at a greatly increased risk of significant COVID complications. We see that particularly in people in the 80+ category, there is a very significant motality rate increase, and as such this is likely to be a very important predictor of death in our dataset.
We see that hospitalisation is correlated with an increased likelihood of death within the set which is not too surprising given that those who are hospitalised are more likely to have a more serious presentation of COVID than those who do not require hospitalisation.
We see ICU admission has a very significant impact on whether somebody is likely to be flagged as having died. Similarly to hospitalisation, this is likely because those who require an ICU admission are likely to have an extreme presentation of COVID, and as such being admitted to ICU is likely to be a very strong indicator of your prognosis. The proportion of missing aligns with those who were not admitted to the ICU. As described in HW1, I strongly suspect that a missing ICU indicator is in fact an indicator that the patient was not admitted to the ICU. The proportion of admissions supports this, and as analysed in assignment 1, this is further supported by the ICU flag being most heavily missing in patients in younger age categories; likely for these patients unless the patient was explicitly admitted, the field was left as unchecked resulting in a missing value, whereas if the patient was admitted they are much more likely to have a value flagged.
Regardless, ICU is a clearly promissing indicator.
People with a medical condition noted are at an elevated risk from COVID. While the correlation does not appear to be as strong as ICU admission or being in an older age category, it is likely to be a relevant factor
We see that there are some races which appear to be disproportionally impacted by COVID, but by and large the proportions are similar except for some minority groups which are not heavily featured in the data set. As there does not appear to be a significant correlation, we do not use this.
Based on the above, we elect to include the following predictive features as having the most relevance to our model:
Although it is not required, we need to determine what a 'default' model would look like. As the supermajority is that patients do not die from COVID, we want to see what a model would look like which predicts that everybody will live.
In the cell below, we create a linear regression model and evaluate it using a number of metrics.
- (2.1) On the training set, train a linear regression model to predict the target feature, using only the descriptive features selected in exercise (1) above.
In the function that generates the model, We are creating a Linear Regression model using only the features which were already listed. We train the model using the training set, completing this requirement.
- (2.2) Print the coefficients learned by the model and discuss their role in the model (e.g., interpret the model by analysing each coefficient and how it relates each input feature to the target feature).
In the function that generates the model, we print out the features and the equation which is used. We see: death_yn = ('age_group_10 - 19 Years' 0.002270128169016561) +('age_group_20 - 29 Years' 0.00021675244052644926) +('age_group_30 - 39 Years' -0.0006233744350718495) +('age_group_40 - 49 Years' -0.0008676843355245704) +('age_group_50 - 59 Years' 0.0025145985771263604) +('age_group_60 - 69 Years' 0.03328400774759359) +('age_group_70 - 79 Years' 0.09617783507054159) +('age_group_80+ Years' 0.2645860166988681) +('age_group_Unknown' 0.10438765773694733) +('hosp_yn_OTH' -3.3306690738754696e-16) +('hosp_yn_Unknown' 0.012189583487253167) +('hosp_yn_Yes' 0.1728663400937809) +('icu_yn_Unknown' 0.01778246632302322) +('icu_yn_Yes' 0.26395229124215197) +('medcond_yn_Unknown' -0.005134164961470882) +('medcond_yn_Yes' 0.021699890131505622) + (-0.02191179249340093)
From this, we observe that the features with the highest positive correlation are the older age categories (70+), hospitalisation status, and ICU status. People with these elements flagged are more likely to be flagged as likely to die from COVID as the model takes these features heavily into account. As the threshold is .5, and the intercept is -.02, and as the age, hospitalisation, icu, and medical condition features are mutually exclusive (in the sense that you can only fall into one age, one ICU value, one hosp value, one med cond value), some combination of hospitalise Yes, ICU Yes, and Age > 70 would be required for the model to predict death as otherwise the coefficients will not add up significantly to meet the 0.5 threshold.
As this is a linear regression model, the coefficients relates to the unit change in the probability of the outcome given the presence of that feature. The intercept refers to a linear transform of the line (i.e. shifting up or down) and corresponds with the baseline case.
With regards to discussing each of these features, it seems unnecessary and overly verbose to run through the weighting of each feature; the key aspects have been highlighted and the weight and importance is clear.
- (2.3) Print the predicted target feature value for the first 10 training examples. Threshold the predicted target feature value given by the linear regression model at 0.5, to get the predicted class for each example. Print the predicted class for the first 10 examples. Print a few classification evaluation measures computed on the full training set (e.g., Accuracy, Confusion matrix, Precision, Recall, F1) and discuss your findings so far.
In the generation function, I have 'predicted' the training data that was used to fit the model:
In the generated function I have thresholded the value and printed the first ten results.
In the generated function I have computed evaluation metrics.
This is not a 'good' evaluation technique as the model has been trained on that same data; as such these results carry little actual weight or insight into how the model will work on new data. Similarly, Linear Regression is not meant to be used in this manner as it is not a classification model and is not really designed for features like this.
At this stage, we see that the model has a high accuracy for predicting not death, however it also is over-enthusiastic in falsely classifying things into not death. While the results are slightly better than just flagging everybody as not death, the model ultimately at this stage appears to be quite poor. Particularly for COVID, we would prefer a model which is overly aggressive in classifying people as potential deaths to ensure those customers get priority treatment, rather than one which will falsely classify patients at significant risk as likely not to die.
These results are over the training set so not a reliable model to really look at, and these features will be examined more in the actual test data, however as an initial baseline we know that we are likely to receive a model which is accurate, but is accurate because it is poor at predicting death while the supermajority is not death leading to a high accuracy.
The results of this are listed below:
As required the First Ten Results predicting for the training data:
Actual Predicted PredictionClass Diff 8396 0 0.003143 0 0 987 0 0.005441 0 0 7274 0 -0.010131 0 0 1000 0 0.036210 0 0 4848 0 0.036210 0 0 9819 0 0.002058 0 0 7109 0 -0.009263 0 0 3123 0 -0.009887 0 0 5279 1 0.196887 0 1 7752 0 0.002303 0 0
----REPORT---- MAE: 0.03366751458925632 MSE: 0.03366751458925632 RMSE: 0.18348709651977252 R2: 0.003572408363848978 ----DETAIL----
Accuracy: 0.9663324854107437
Confusion matrix: [[6445 4] [ 221 13]]
Classification report: precision recall f1-score support
0 0.97 1.00 0.98 6449
1 0.76 0.06 0.10 234
accuracy 0.97 6683
macro avg 0.87 0.53 0.54 6683 weighted avg 0.96 0.97 0.95 6683
- (2.4) Evaluate the model using classification evaluation measures on the hold-out (30% examples) test set. Compare these results with the evaluation results obtained on the training (70%) dataset. Also compare these results with a cross-validated model (i.e., a new model trained and evaluated using cross-validation on the full dataset). You can use classic k-fold cross-validation or repeated random train/test (70/30) splits. Compare the cross-validation metrics to those obtained on the single train/test split and discuss your findings.
In the generation function, I have evaluated the model on the test set and kept these as the main results. Compared to the previously produced evaluation on the training data, the key shift we observe is that this results in notably worse predictive power for deaths. While the model retains a high accuracy, this is largely originating from the fact that it is much too heavily flagging people as not death. This is likely heavily influenced by the fact that in our data set the total amount of positive death instances is small and the data shifts heavily towards non-death. The act of splitting the full CDC dataset into groups of ten thousand is likely playing a significant impact in hampering the model's ability to correctly classify deaths. This suggests that in testing on a new data set, the model would likely be poorly performant in correctly classifying patients who are at significant risk from COVID.
In the generating function I have provided a cross-validation over 5 folds on the entire data set. We observe the RMSE of the cross validation averaged at 0.15688330546365153 vs. approx. 0.18 for when predicting the training set.
Based on our results, we see the f1 macro average is approx 59% on the test set which ultimately suggests that our prediction is only barely more accurate than guessing, and is ultimately only marginally better than the 0.49 received by flagging everybody as Not Death (particularly when the RMSE of that is similar at 0.18 also). This is slightly higher than on the training set but we observe that the model precision on both metrics has dropped.
Ultimately, based on the creation and analysis of our Linear Regression Model, we can determine that ultimately this is a poor model. While the accuracy is high, this is driven by the model heavily biasing towards flagging people as not death (flagging people who did die as not die incorrectly) when not death occupies the majority of our data set. Linear regression is not meant for feature classification in this manner, so it should be expected that the results of this are not particularly strong. We should ideally not use this model.
- (3.1) On the training set, train a logistic regression model to predict the target feature, using the descriptive features selected in exercise (1) above.
As in linear regression, this is done.
- (3.2) Print the coefficients learned by the model and discuss their role in the model (e.g., interpret the model).
death_yn = logistic( ('age_group_10 - 19 Years' -1.3954739647453331) +('age_group_20 - 29 Years' -1.7477619209329482) +('age_group_30 - 39 Years' -1.437896043371963) +('age_group_40 - 49 Years' -0.5467412391146604) +('age_group_50 - 59 Years' 0.08372144064151235) +('age_group_60 - 69 Years' 1.385612326921969) +('age_group_70 - 79 Years' 2.1476304546024774) +('age_group_80+ Years' 3.2574857584006645) +('age_group_Unknown' 0.8325318957143447) +('hosp_yn_OTH' 0.0) +('hosp_yn_Unknown' 0.8415435273812718) +('hosp_yn_Yes' 2.3672678007865713) +('icu_yn_Unknown' 0.4915252721992774) +('icu_yn_Yes' 1.9132227077415447) +('medcond_yn_Unknown' 0.3089176968434786) +('medcond_yn_Yes' 0.9468181138631563)
where logistic(x) is the standard log transform.
I.e. for F={(age features, age weighting),(hosp features, hosp weighting),(icu features, icu weighting), (med_cond features, med_con weighting)} we have $ \mathrm{P}(death\_yn=1|F)=\sum\limits_{f=(f_1,f_2) \in F} \frac{e ^ {-({-6.12674106 + f_2 * f_1})}}{1 +e ^ {-({-6.12674106 + f_2 * f_1})}}$
The coefficient therefore represents the log odd change and hence smaller changes have a more significant impact in the result particularly as the coefficient increases. To this end, we see that again the older age groups are significantly considered by the model, and again it is sensitive to ICU and Hospitalisation. The intercept refers to a shifting of the curve and dictates the base case.
- (3.3) Print the predicted target feature value for the first 10 training examples. Print the predicted class for the first 10 examples. Print a few classification evaluation measures computed on the full training set (e.g., Accuracy, Confusion matrix, Precision, Recall, F1) and discuss your findings so far.
Again this is taken care of in the function. The first ten rows were printed and classification measures were printed.
Looking at the evaluation of the data on the training set, we see that compared to the linear regression model, the logistic regression model is doing a better job at predicting the death_yn feature. While there is an increase in patients being flagged as potentially dying, the number of true positives were more accurately determined over the training set, and macro average is significantly improved from where it had been in the previous linear regression model. The model is still underflagging deaths, which is a problem given that people at risk need the care given and must be identified in the context of healthcare, but it is an improvement on what had been achieved using a very simple regression model.
As required the First Ten Results predicting for the training data: Actual Predicted PredictionClass Diff 8396 0 0 0 0 987 0 0 0 0 7274 0 0 0 0 1000 0 0 0 0 4848 0 0 0 0 9819 0 0 0 0 7109 0 0 0 0 3123 0 0 0 0 5279 1 0 0 1 7752 0 0 0 0 ----REPORT---- MAE: 0.03321861439473291 MSE: 0.03321861439473291 RMSE: 0.1822597443066705 R2: 0.01685810958566436 ----DETAIL----
Accuracy: 0.966781385605267
Confusion matrix: [[6398 51] [ 171 63]]
Classification report: precision recall f1-score support
0 0.97 0.99 0.98 6449
1 0.55 0.27 0.36 234
accuracy 0.97 6683
macro avg 0.76 0.63 0.67 6683 weighted avg 0.96 0.97 0.96 6683
- (3.4) Evaluate the model using classification evaluation measures on the hold-out (30% examples) test set. Compare these results with the evaluation results obtained when using the training (70%) dataset for evaluation. Also compare these results with a cross-validated model (i.e., a new model trained and evaluated using cross-validation on the full dataset). You can use classic k-fold cross-validation or repeated train/test (70/30) splits. Compare the cross-validation metrics to those obtained on the single train/test split and discuss your findings.
In the generating function I have printed these results. We can observe that this model is performing more strongly than our simple linear regression model and is achieving a macro f1-score on the training data of 0.70. This is slightly strong than what is achieved over the training set and is a good sign that the model is generalised and not overly fitted to the training data. In fact, almost all metrics on the test data is higher than what was observed for the training data. This might be an area which would warrant further investigation to understand why this is so and ensure that the comparatively strong results are a consequence of good generalisation. We ee the RMSE is .175 while over 5-fold validation this is .18 with the model being comparatively consistent. The model correctly identified 31 of the patients who had died of COVID which is a much stronger result than what was seen in the linear regression example. Particularly on account of the context in a healthcare setting, it is much more important that the model correctly flags patients who will die/are at risk than it is to correctly classify patients who are healthy, so long as the false positive rate isn't so high as to overburden the healthcare system.
- (3.1) On the training set, train a random forest model to predict the target feature, using the descriptive features selected in exercise (1) above.
This is done.
- (3.2) Can you interpret the random forest model? Discuss any knowledge you can gain in regard of the working of this model.
As in other models, the generating function for the model has plotted the feature importance. We can see that the key features which the model is using to weigh the results are if you were admitted to the ICU, if you are in the over 80 age group,if you were hospitalised, if you are in the over 70 age group, and if you previously have medical conditions. (Being over 80 or hospitalised weighted with almost three times the weight of the next highest weighted feature).
- (3.3) Print the predicted target feature value for the first 10 training examples. Print the predicted class for the first 10 examples. Print a few classification evaluation measures computed on the full training set (e.g., Accuracy, Confusion matrix, Precision, Recall, F1) and discuss your findings so far.
This is done.
Based on the training data results, we see that the macro avg f1 score is .72 and it has a high accuracy of 97%. This has been the most strongly performing model examined so far over the sample data which I have, but has a comparable performance to that of the Logistic Regression Model over the training data set. Similar to the other models which have been looked at, the biggest challenge for the model is in accurately classifying deaths, but this model has done better than all previous models in doing so when looking at the training sets.
- (3.4) Evaluate the model using classification evaluation measures on the hold-out (30% examples) test set. Compare these results with the evaluation results obtained when using the training (70%) dataset for evaluation. Also compare these results with a cross-validated model (i.e., a new model trained and evaluated using cross-validation on the full dataset). You can use classic k-fold cross-validation or repeated train/test (70/30) splits. Compare the cross-validation metrics to those obtained on the single train/test split and to the out-of-sample error and discuss your findings.
In the generating function I have printed these results. We can observe that this model is performing more strongly than our simple linear regression model, our everybody lives model, and our logistic regression model and is achieving a macro f1-score on the training data of 0.71 and has an ootb accuracy of 97% with a 5-fold ootb accuracy of 98.3%. While the training data is slightly worse than on the test data, the difference is quite negligible suggesting the model is well-generalised and not overfit. RMSE is lower than the Logistic model and the death accuracy is slightly higher. Based on these results, I would recommend that the RF model is what is used in a production setting.
- (5.1) Which model of the ones trained above performs better at predicting the target feature?
-Is it more accurate than a simple model that always predicts the majority class (i.e., if 'no' is the majority class in your dataset, the simple model always predicts 'no' for the target feature)? Justify your answers.
In the cells below I have created some comparisons of each of the models which were built. I have also compared these models against one another while describing the performance individually in the previous sections. Each of the models performaned have better performance than the simple everybody lives performance largely driven by the fact that while the everybody lives model may have in some instances a higher accuracy, this is driven by the majority class being that patients live, yet it fails to predict death (due to the nature of the model). In the context of this assignment, that is an incredibly poor aspect of the model's performance as it means that patients who are at risk of dying from COVID would not be appropriately flagged and hence could receive a poor case outcome because doctor's are not aware of the underpinning workings of the model.
The model which has the best balance of true positives, false positives, and balancing all aspects of the model is the Random Forest Model which slightly eeks out ahead of the Regression (Logistic) model.Overall, the XGBoost model which was a non-mandatory aspect of the assignment is the best performing model, however its performance is overall comparable to the Random Forest model (this is not too surprising as the default operation of XGBoost is to use GBTrees in creating the model). The one downside for the XGBoost model is that the training time is more significant than other models due to the presence of GridSearch to try and pinpoint the optimal hyperparameters.
We see in the graphs below that over the test dataset, the performance of Logistic Regression, Random Forest, and XGBoost are all quite comparable with relatively minor performance differences between the three. I would recommend comparing and adding AutoML as an additional model, as it is highly performant and encorporates ensembling and hyperparameter optimisation and is well-regarded as an easily implementable out of the box ML kit from Google's team, but this is out of scope of the assignment.
Conclusion: Random Forest of the required models, XGBoost overall. Minor differences between Logistic, RF, and XGB. Avoid: Linear Regression, Simple.
This has been slightly done in the previous section and is outlined in the problem scope section in the intro.
We are trying to predict whether a patient is likely to have a good (living) or bad (dying) prognosis based on a combination of demographic details and their patient history by creating ML models.
There are two key challenges with this:
To do this, we developed five models (simple, Linear Regression, Logistic Regression, Random Forest, and XGBoost) and compared their performance over both the training set and the test set, and analysed the results for each model, paying particular attention to the number of deaths which were produced and the overall precision.
Based on including ICU, Age, Medical Condition History, and Hospitalisation Status, we see that the Random Forest Model is the best performing model of the base models, and XGBoost is the overall strongest but was an additional model not mandatory for this assignment. The performance of Logistic Regression, Random Forest, and XGBoost were overall very comparable with minor performance differences between the three, while Linear Regression was poorly performant and the Simple Model, although it had a high accuracy, was totally inappropriate for the context of the problem.
Yes, we could create a Gender Specific Model. We see that gender distributions of death are different, so we could create one model for male patients, and one model for Female or Unknown Patients. We would then call the model which is relevant to the patients' gender. Alternatively, we could additional features or include an age split. We could also use XGBoost as I've already done.
(this is a summary and recap: please see full conclusions below)
As part of testing extensions of our model, we have tried:
In all cases, each of the first two model potential improvements have resulted in an overall inferior model, except for the extension of the model to include all features for XGBoost where we got similar performance.
Based on these results, and particularly driven by the Xgboost model only attaining a performance onpar with the original model, I suspect the key requirement for improving results further will be to gather additional data, or as I have demonstrated with the usage of XGBoost, developing a new model.
Ideally, we would also gather more granular patient data to develop more 'hard-hitting' features such as history of pulmonary illness or cardiovascular risk indicators which are significant for COVID patients. Based on the performance of the XGBoost model, I believe there is only negligible room for improvement using these features outside of more sophisticated methods which are outside the scope of this module, and think the best chance for improvement will come from additional data being used to train the model.
- (4.3) Take your best model trained and selected based on past data (ie your cleaned Homework1 dataset), and evaluate it on the new test dataset provided with this homework (in file '24032021-covid19-cdc-deathyn-recent-10k.csv'). Discuss your findings.
First, I read in the new file and determine it has data quality errors. As a result, I have to copy my assignment 1 file into a function. Because I'm assuming duplicates are required as part of the prediction, I do not drop rows which contain duplicates unlike in my assignment 1 submission, but all other components are the same. I then create a new Random Forest model which is trained over all of the Historic data file and use that for the prediction of each row in the new cleansed file.
(commentary and findings)
We see both XGboost and RF models perform worse on the new historical data set compared to the test results. Random Forest ended up slightly better (although both were very close) in macro avg f1 score.
Based on the drop in sensitivity in predicting deaths on the new dataset, it is likely that time is an important factor in the outcome of COVID. This makes sense as the original data includes cases from the beginning of the pandemic and probable cases, where mortality was high due to a poor understanding of the disease and uncertainty in how to treat it. As time has improved, the there has been a greater development in understanding of what factors are significant.
Due to this time effect, it is likely important to refresh the models and re-examine the features which are used to build the model, likely to account for the stage of the pandemic in which the diagnosis was featured, in order to account for this. This would likely lead to a higher degree of accuracy over the new dataset.
Ultimately, while neither model is optimal by any means, both models are good as an indicator and better than guessing. It would be important in the productionising of these models to ensure it is very clearly noted that the result is only an indicator as we still see that it fails to classify every death case.
Personally, because of the ethical considerations, I would advise that the model is not deployed unless the death classification can be significantly improved either by collecting vastly more data or by getting a better dataset to work with (particularly involving patient history or regional data).
####--------------------------------------
#00.Import Modules
####--------------------------------------
######---------BEGIN
# SUPPRESS DEPRECIATION WARNINGS: Applicable to datetime_is_numeric=True
######--------END
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
######---------BEGIN
# ML
######--------END
import nltk as nl
import sklearn as sk
import matplotlib as mp
import xgboost as xg
import pymc3 as pymc
import sympy as sym
from sklearn.model_selection import train_test_split, KFold, cross_val_score, GridSearchCV
from sklearn.metrics import mean_squared_error, accuracy_score, precision_score, recall_score
from sklearn import metrics
from sklearn.tree import export_graphviz
import graphviz
from graphviz import Source
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
######---------BEGIN
# SQL
######--------END
import requests as rq
import sqlalchemy as sqla
#import pyodbc
#import cx_oracle as cx
######---------BEGIN
# GENERAL
######--------END
import pandas as pd
import datetime as dt
import numpy as np
import sys
import os
import json
import time
import socket
import traceback as tb
import platform
import json
import pprint
import pickle
######---------BEGIN
# DATA VIS
######--------END
import seaborn as sns
import matplotlib as mp
#from bokeh import *
#from dash import *
import matplotlib.pyplot as plt
from matplotlib.backends.backend_pdf import PdfPages
import matplotlib.dates as mdates
We are going to use the GGPlot2 Stylesheet
plt.style.use('ggplot')
#!jupyter nbextension enable hide_input_all/main
#!jupyter nbextension disable hide_input/main
#!jupyter nbextension disable codefolding/main
#!jupyter nbconvert --to=html 14395076_Adam_Ryan_HW1_COMP47350.ipynb
###-----Filepath and No----###
student_no='14395076'
original_filepath=".//covid19-cdc-{}.csv".format(student_no)
staging_filepath=".//02_staging_covid19-cdc-{}.csv".format(student_no)
cleansed_filepath=".//03_cleansed_covid19-cdc-{}.csv".format(student_no)
cleansed_pickle_filepath=".//03_cleansed_covid19-cdc-{}.pickle".format(student_no)
extended_filepath="00F_adf_covid_data_{}.csv".format(student_no)
extended_pickle_filepath="00F_adf_covid_data_{}.pickle".format(student_no)
multi_categorical_barplot_initial_fn=".//01_multi_categorical_barplot_cdc-{}.pdf".format(student_no)
single_categorical_barplot_initial_fn=".//01_single_categorical_barplot_cdc-{}.pdf".format(student_no)
dupe_single_categorical_barplot_initial_fn=".//02_dupe_single_categorical_barplot_cdc-{}.pdf".format(student_no)
dupe_multi_categorical_barplot_initial_fn=".//02_dupe_multi_categorical_barplot_cdc-{}.pdf".format(student_no)
multi_categorical_barplot_dedupe_fn=".//03_multi_categorical_barplot_after_dedupe_cdc-{}.pdf".format(student_no)
single_categorical_barplot_dedupe_fn=".//03_single_categorical_barplot_after_dedupe_cdc-{}.pdf".format(student_no)
single_categorical_barplot_dedupe_cleanse1_fn=".//04_single_categorical_barplot_after_dedupe_initCleanse_cdc-{}.pdf".format(student_no)
multi_categorical_barplot_dedupe_cleanse1_fn=".//04_multi_categorical_barplot_after_dedupe_initCleanse_cdc-{}.pdf".format(student_no)
single_categorical_barplot_dedupe_cleanseF_fn=".//05_single_categorical_barplot_after_dedupe_finalCleanse_cdc-{}.pdf".format(student_no)
multi_categorical_barplot_dedupe_cleanseF_fn=".//05_multi_categorical_barplot_after_dedupe_finalCleanse_cdc-{}.pdf".format(student_no)
single_categorical_barplot_adf_fn=".//06_adf_single_categorical_barplot_cdc-{}.pdf".format(student_no)
multi_categorical_barplot_adf_fn=".//06_adf_multi_categorical_barplot_cdc-{}.pdf".format(student_no)
stack_multi_categorical_barplot_adf_fn=".//07_adf_multi_categorical_stack_barplot_cdc-{}.pdf".format(student_no)
swarm_death_multi_categorical_barplot_adf_fn=".//08_adf_multi_categorical_swarm_barplot_cdc-{}.pdf".format(student_no)
###-----DATES-----###
#Dates for File Additions if needed
today_date=dt.datetime.now()
#DateTime objects
today_year=today_date.year
today_month=today_date.month
today_day=today_date.day
#Convert to ISO Standard for Filename
str_year=str(today_date.year)
#Month should have two digits
str_month=str(today_date.month)
if len(str_month)==1:
str_month="0{}".format(str_month)
#Day should have two digits
str_day=str(today_date.day)
if len(str_day)==1:
str_day="0{}".format(str_day)
str_today_date="{}-{}-{}".format(str_year,str_month,str_day)
###--------Column Features--------###
#This is the CDC Data Dictionary - Here for Reference
data_dictionary_per_cdc={
'cdc_case_earliest_dt':['The earlier of the Clinical Date (date related to the illness or specimen collection) or the Date Received by CDC','datetime']
,'cdc_report_dt':['Initial case report date to CDC. Deprecated, use new cdc_case_earliest_dt','datetime']
,'pos_spec_dt':['Date of first positive specimen collection','datetime']
,'onset_dt':['Symptom onset date, if symptomatic','datetime']
,'current_status':['Case Status: Laboratory-confirmed case; Probable case','category']
,'sex':['Sex: Male; Female; Unknown; Other','category']
,'age_group':['Age Group: 0 - 9 Years; 10 - 19 Years; 20 - 39 Years; 40 - 49 Years; 50 - 59 Years; 60 - 69 Years; 70 - 79 Years; 80 + Years','category']
,'race_ethnicity_combined':['Race and ethnicity (combined): Hispanic/Latino; American Indian / Alaska Native, Non-Hispanic; Asian, Non-Hispanic; Black, Non-Hispanic; Native Hawaiian / Other Pacific Islander, Non-Hispanic; White, Non-Hispanic; Multiple/Other, Non-Hispanic','category']
,'hosp_yn':['Hospitalization status','category']
,'icu_yn':['ICU admission status','category']
,'death_yn':['Death status','category']
,'medcond_yn':['Presence of underlying comorbidity or disease','category']
}
valid_sex_values=['Male',
'Female',
'Unknown',
'Other']
valid_age_groupings=['0 - 9 Years',
'10 - 19 Years',
'20 - 39 Years',
'40 - 49 Years',
'50 - 59 Years',
'60 - 69 Years',
'70 - 79 Years',
'80 + Years']
valid_race_values=['Hispanic/Latino',
'American Indian / Alaska Native, Non-Hispanic',
'Asian, Non-Hispanic',
'Black, Non-Hispanic',
'Native Hawaiian / Other Pacific Islander, Non-Hispanic',
'White, Non-Hispanic',
'Multiple/Other, Non-Hispanic']
#### NUMPY SEED FOR SKLEARN
np.random.seed(int(student_no))
I'm relisting the functions from Submission 1 as there's no point in rewriting these functions when they may have use in this submission.
def ingest_orig_covid_data(fp,cdc_dictionary):
"""A function to read in CSV Data and Validate """
print("Inside ingest_orig_covid_data({},dictionary)".format(fp))
#Valid Filepath
if os.path.isfile(fp):
#read_csv - Do Not Let Pandas Manipulate the Data First - Auto-assign is more memory intensive.
raw_df=pd.read_csv(fp,dtype=str)
print(raw_df)
#row_column data
shape_of_df=raw_df.shape
row_count=shape_of_df[0]
column_count=shape_of_df[1]
#print info to user
row_column_print_statement='Your file contains: \n{} rows x {} columns.\n\n'
row_column_print_statement=row_column_print_statement.format(row_count,column_count)
print(row_column_print_statement)
header_statement='The following columns are present:\n'
#print the headers
for header in raw_df.columns:
header_statement+='"{}"\n'.format(header)
print(header_statement)
#check if the schema is correct
if set(raw_df.columns)==set(cdc_dictionary.keys()) and len(raw_df.columns)==len(cdc_dictionary.keys()):
print('The columns in this data sample match the CDCs schema')
else:
print('The columns in this data sample do not match the CDCs schema')
return raw_df
#Not Valid Filepath
else:
print("Invalid filepath - Correct the filepath and re-ingest")
return
def data_convert(df,types,columnlist,dt_format):
"""A function to convert all columns in a list into the appropriate type"""
print("Inside data_convert()")
###Check if empty
if df.empty==False:
##Check if datetime or other
if types=='datetime':
###Check if 0
if len(columnlist)>0:
print('Converting to {}'.format(types))
df[columnlist]=df[columnlist].apply(pd.to_datetime,format=dt_format,errors='ignore')
else:
print('No need to convert to: {}'.format(types))
###Numeric type
elif types=='category':
###Check if 0
if len(columnlist)>0:
print('Converting to {}'.format(types))
df[columnlist]=df[columnlist].astype('category')
###Nothing to convert
else:
print('No need to convert')
###Numeric type
elif types=='numeric':
###Check if 0
if len(columnlist)>0:
print('Converting to Numerical')
df[columnlist]=df[columnlist].apply(pd.to_numeric, errors='ignore')
else:
print('No need to convert')
elif types=='boolean':
###Check if 0
if len(columnlist)>0:
print('Converting to Numerical')
df[columnlist]=df[columnlist].astype(bool)
else:
print('No need to convert')
###Other type - e.g. Boolean, string - Dont do anything - force the above types.
else:
print('Unknown type')
###Empty data set
else:
print("Empty dataframe")
def missing_check(row):
"""Highlight rows with potential missing_values"""
#Configuration Values
col_to_check=10
default_colour = 'white'
flag_colour=''
high_flag_colour_val='red'
med_flag_colour_val='orange'
low_flag_colour_val='yellow'
val_to_check=0
#Row length valid
if len(row)>=col_to_check:
#
if row.values[col_to_check] == 'High':
flag_colour = high_flag_colour_val
elif row.values[col_to_check] == 'Medium':
flag_colour = med_flag_colour_val
elif row.values[col_to_check] == 'Low':
flag_colour = low_flag_colour_val
if flag_colour=='':
colour=default_colour
else:
colour=flag_colour
return ['background-color: {}'.format(colour)]*len(row.values)
else:
print('Row too short - Reconfigure Column Number')
return ['background-color: {}'.format(default_colour)]*len(row.values)
def dt_missing_check(row):
"""Highlight rows with potential missing_values"""
#Configuration Values
col_to_check=9
default_colour = 'white'
flag_colour=''
high_flag_colour_val='red'
med_flag_colour_val='orange'
low_flag_colour_val='yellow'
val_to_check=0
#Row length valid
if len(row)>=col_to_check:
#
if row.values[col_to_check] == 'High':
flag_colour = high_flag_colour_val
elif row.values[col_to_check] == 'Medium':
flag_colour = med_flag_colour_val
elif row.values[col_to_check] == 'Low':
flag_colour = low_flag_colour_val
if flag_colour=='':
colour=default_colour
else:
colour=flag_colour
return ['background-color: {}'.format(colour)]*len(row.values)
else:
print('Row too short - Reconfigure Column Number')
return ['background-color: {}'.format(default_colour)]*len(row.values)
def group_over_single_categories(df,categorical_columns,pdf_fn, save_fig=True):
"""A function to group over the categories"""
print("Inside group_over_single_categories()")
row_count=len(df)
grouping_type={}
timestamp_now=dt.datetime.timestamp(dt.datetime.now())
#Dataframe is not empty, and there are categorical columns to group over:
if df.empty==False and len(categorical_columns)>0:
with PdfPages(pdf_fn) as pp: #lab
column=''
#Let's go through the category column type
for column in categorical_columns:
#Separator
print('\n\n----------------------\n\n')
agg_df=df.groupby([column]).agg({df.columns[0]:"count"})
print(agg_df)
agg_df=agg_df.reset_index()
agg_df=agg_df.rename(columns={df.columns[0]:'Rows'})
#Note: Could also do value_counts but I prefer that for graphing.
agg_df['% Frequency']=100*(agg_df['Rows']/row_count)
#Be explicit over what we're displaying
print('Grouping over {} results in:\n'.format(column))
#Display the result
display(agg_df)
#Graphing Section:
figure = (
df[column]
.value_counts(dropna=True, normalize=True)
.plot(kind='bar'
,title='Count of values for {}'.format(column)
, xlabel='Field Values'
, ylabel='Count of Values'
, figsize=(35,35)
)
)
#This grid style is from the sample Lab5 as I like how it looks
plt.ylim([0,1])
plt.grid(b=True, which='major', color='#666666', linestyle='-')
plt.setp(figure.get_xticklabels(), ha="right", rotation=0)
plt.minorticks_on()
plt.grid(b=True, which='minor', color='#999999', linestyle='-', alpha=0.2)
plt.legend(loc='upper left', bbox_to_anchor=(1,1), ncol=1)
plt.show()
grouping_type[column]=agg_df
if save_fig:
pp.savefig(figure.get_figure())
return grouping_type
def group_over_multi_categories(df,categorical_columns,pdf_fn,save_output=False,save_fig=False):
"""A function to group over all pairs of categories
Warning: This can be memory intensive as we have (columnCount)C(2) pairings, so only run this if your device is able!"""
print("Inside group_over_multi_categories()")
row_count=len(df)
grouping_type={}
timestamp_now=dt.datetime.timestamp(dt.datetime.now())
#Try run this
try:
#Dataframe is not empty, and there are categorical columns to group over:
if df.empty==False and len(categorical_columns)>0:
with PdfPages(pdf_fn) as pp:
column=''
second_column=''
#Let's go through the category column type
for column in categorical_columns:
#Second index, n^2
for second_column in categorical_columns:
multi_column=[column]
#Create a key to access - pipe delimited as columns contain _
grouping_key="{}|{}"
#No point in grouping the same column twice
if second_column!=column:
multi_column+=[second_column]
grouping_key=grouping_key.format(column,second_column)
#Separator
print('\n\n----------------------\n\n')
agg_df=df.groupby(multi_column).agg({df.columns[0]:"count"})
agg_df=agg_df.reset_index()
agg_df=agg_df.rename(columns={df.columns[0]:'Rows'})
agg_df['% Frequency']=100*(agg_df['Rows']/row_count)
#Be explicit over what we're displaying
print('Grouping over {} results in:\n'.format(grouping_key))
#Display the result
display(agg_df)
#Graph
figure = (
(df[multi_column]
.dropna()
.value_counts(normalize=True)
.reset_index()
.pivot_table(index=column,columns=second_column)
.fillna(0))[0]
.plot(kind='bar'
, stacked=True
, title='Count of values for {} vs {}'.format(second_column,column)
, xlabel='Field Values'
, ylabel='Count of Values'
, figsize=(35,35)
)
)
#This grid style is from the sample Lab5 as I like how it looks
plt.ylim([0,1])
plt.grid(b=True, which='major', color='#666666', linestyle='-')
plt.setp(figure.get_xticklabels(), ha="right", rotation=0)
plt.minorticks_on()
plt.grid(b=True, which='minor', color='#999999', linestyle='-', alpha=0.2)
plt.legend(loc='upper left', bbox_to_anchor=(1,1), ncol=1)
plt.show()
if save_fig:
pp.savefig(figure.get_figure())
#Only save if explicitly passed - This could kill your memory.
if save_output:
grouping_type[grouping_key]=agg_df
#Catch exceptions
except Exception as exc:
print("Function exception:\n")
#check exception is memory error
if exc==MemoryError:
print("Sorry, your device is not able to run this function as you have hit a memory limit")
print(exc)
return grouping_type
def cat_missing_check_cleanse(row):
"""Highlight rows with potential missing_values"""
#Configuration Values
col_to_check=8
default_colour = 'green'
flag_colour=''
high_flag_colour_val='red'
med_flag_colour_val='orange'
low_flag_colour_val='yellow'
val_to_check=0
#Row length valid
if len(row)>=col_to_check:
#
if row.values[col_to_check] == 'High':
flag_colour = high_flag_colour_val
elif row.values[col_to_check] == 'Medium':
flag_colour = med_flag_colour_val
elif row.values[col_to_check] == 'Low':
flag_colour = low_flag_colour_val
if flag_colour=='':
colour=default_colour
else:
colour=flag_colour
return ['background-color: {}'.format(colour)]*len(row.values)
else:
print('Row too short - Reconfigure Column Number')
return ['background-color: {}'.format(default_colour)]*len(row.values)
def stacked_group_over_multi_categories(df,categorical_columns,pdf_fn,save_output=False,save_fig=False):
"""A function to group over all pairs of categories
Warning: This can be memory intensive as we have (columnCount)C(2) pairings, so only run this if your device is able!"""
print("Inside group_over_multi_categories()")
row_count=len(df)
grouping_type={}
timestamp_now=dt.datetime.timestamp(dt.datetime.now())
#Try run this
try:
#Dataframe is not empty, and there are categorical columns to group over:
if df.empty==False and len(categorical_columns)>0:
with PdfPages(pdf_fn) as pp:
column=''
second_column=''
#Let's go through the category column type
for column in categorical_columns:
#Second index, n^2
for second_column in categorical_columns:
multi_column=[column]
#Create a key to access - pipe delimited as columns contain _
grouping_key="{}|{}"
#No point in grouping the same column twice
if second_column!=column:
multi_column+=[second_column]
grouping_key=grouping_key.format(column,second_column)
#Separator
print('\n\n----------------------\n\n')
agg_df=df.groupby(multi_column).agg({df.columns[0]:"count"})
agg_df=agg_df.reset_index()
agg_df=agg_df.rename(columns={df.columns[0]:'Rows'})
agg_df['% Frequency']=100*(agg_df['Rows']/row_count)
#Be explicit over what we're displaying
print('Grouping over {} results in:\n'.format(grouping_key))
#Display the result
display(agg_df)
agg_df=agg_df.reset_index()
sagg_df=(
df
.groupby([column])
.agg({df.columns[0]:"count"})
.reset_index()
.rename(columns={df.columns[0]:'TotalRows'})
)
join_df=agg_df.merge(sagg_df,left_on=column,right_on=column,suffixes=('_subbed','_group'))
join_df['% Stacked']=join_df['Rows']/join_df['TotalRows']
figure=((join_df
.pivot_table(index=column,columns=second_column,values='% Stacked')
.fillna(0))
.plot(kind='bar'
, stacked=True
, title='Distribution of values for {} vs {}'.format(second_column,column)
, xlabel='Field Values'
, ylabel='Makeup of Values'
, figsize=(35,35)
))
#This grid style is from the sample Lab5 as I like how it looks
plt.ylim([0,1])
plt.grid(b=True, which='major', color='#666666', linestyle='-')
plt.setp(figure.get_xticklabels(), ha="right", rotation=0)
plt.minorticks_on()
plt.grid(b=True, which='minor', color='#999999', linestyle='-', alpha=0.2)
plt.legend(loc='upper left', bbox_to_anchor=(1,1), ncol=1)
plt.show()
if save_fig:
pp.savefig(figure.get_figure())
#Only save if explicitly passed - This could kill your memory.
if save_output:
grouping_type[grouping_key]=agg_df
#Catch exceptions
except Exception as exc:
print("Function exception:\n")
#check exception is memory error
if exc==MemoryError:
print("Sorry, your device is not able to run this function as you have hit a memory limit")
print(exc)
return grouping_type
def stacked_group_over_target_categories(df,categorical_columns,pdf_fn,save_output=False,save_fig=False):
"""A function to group over all pairs of categories and the target death_yn
Warning: This can be memory intensive as we have so only run this if your device is able!"""
print("Inside group_over_multi_categories()")
row_count=len(df)
grouping_type={}
timestamp_now=dt.datetime.timestamp(dt.datetime.now())
#Try run this
try:
#Dataframe is not empty, and there are categorical columns to group over:
if df.empty==False and len(categorical_columns)>0:
with PdfPages(pdf_fn) as pp:
column=''
second_column=''
#Let's go through the category column type
for column in categorical_columns:
#Second index, n^2
for second_column in ['death_yn']:
multi_column=[column]
#Create a key to access - pipe delimited as columns contain _
grouping_key="{}|{}"
#No point in grouping the same column twice
if second_column!=column:
multi_column+=[second_column]
grouping_key=grouping_key.format(column,second_column)
#Separator
print('\n\n----------------------\n\n')
agg_df=df.groupby(multi_column).agg({df.columns[0]:"count"})
agg_df=agg_df.reset_index()
agg_df=agg_df.rename(columns={df.columns[0]:'Rows'})
agg_df['% Frequency']=100*(agg_df['Rows']/row_count)
#Be explicit over what we're displaying
print('Grouping over {} results in:\n'.format(grouping_key))
#Display the result
display(agg_df)
agg_df=agg_df.reset_index()
sagg_df=(
df
.groupby([column])
.agg({df.columns[0]:"count"})
.reset_index()
.rename(columns={df.columns[0]:'TotalRows'})
)
join_df=agg_df.merge(sagg_df,left_on=column,right_on=column,suffixes=('_subbed','_group'))
join_df['% Stacked']=join_df['Rows']/join_df['TotalRows']
figure=((join_df
.pivot_table(index=column,columns=second_column,values='% Stacked')
.fillna(0))
.plot(kind='bar'
, stacked=True
, title='Distribution of values for {} vs {}'.format(second_column,column)
, xlabel='Field Values'
, ylabel='Makeup of Values'
, figsize=(35,35)
))
#This grid style is from the sample Lab5 as I like how it looks
plt.ylim([0,1])
plt.grid(b=True, which='major', color='#666666', linestyle='-')
plt.setp(figure.get_xticklabels(), ha="right", rotation=0)
plt.minorticks_on()
plt.grid(b=True, which='minor', color='#999999', linestyle='-', alpha=0.2)
plt.legend(loc='upper left', bbox_to_anchor=(1,1), ncol=1)
plt.show()
if save_fig:
pp.savefig(figure.get_figure())
#Only save if explicitly passed - This could kill your memory.
if save_output:
grouping_type[grouping_key]=agg_df
#Catch exceptions
except Exception as exc:
print("Function exception:\n")
#check exception is memory error
if exc==MemoryError:
print("Sorry, your device is not able to run this function as you have hit a memory limit")
print(exc)
return grouping_type
def group_by_column(df,groupby_columns,agg_dict):
"""A function to group by columns given and aggregate according to a dictionary.
Input: df, columns to group by, agg_dictionary
"""
print("inside group_by_column(df,{},{})".format(groupby_columns,agg_dict))
#Possible Errors
error_dictionary={0:'No Error'
,1:'The dataframe is empty'
,2:"The columns to group by is empty or not a list"
,3: 'The dictionary is empty'
,4: 'The dataframe does not contain the required columns'
,999: 'Uncaught exception'
}
#Set as empty
summary_df=pd.DataFrame()
required_columns=[]
error_code=0
try:
#Dictionary is non-empty
if len(agg_dict)>0 and type(agg_dict)==dict:
#df not empty
if len(df)>0:
#List and non-empty
if type(groupby_columns)==list and len(groupby_columns)>0:
required_columns=list(df.columns)+list(agg_dict.keys())
#Required columns found
if set(required_columns).issubset(set(df.columns)):
#begin groupby - note: not catching summary issues as they are plentiful
summary_df=(df
.groupby(groupby_columns)
.agg(agg_dict)
.reset_index()
)
#Required columns not found
else:
error_code=4
error_message=error_dictionary[error_code]
print(error_message)
#Not a list or empty
else:
error_code=2
error_message=error_dictionary[error_code]
print(error_message)
#df is not empty
else:
error_code=1
error_message=error_dictionary[error_code]
print(error_message)
#empty Dictionary
else:
error_code=3
error_message=error_dictionary[error_code]
print(error_message)
except Exception as e:
error_code=999
print("Uncaught exception: {}".format(e))
return [error_code,summary_df]
def create_xgboost_model(fulldf,train_df,test_df,target_column,plot_comp=True,plot_tree=True, threshhold_class=0.5):
"""Create an xgboostmodel"""
X=fulldf.drop([target_column], axis=1)
y=fulldf[target_column]
X_test=test_df.drop([target_column], axis=1)
y_test= test_df[target_column]
X_train=train_df.drop([target_column], axis=1)
y_train=train_df[target_column]
#Paramter Dictionary for hypertuning
model_parameters = {'nthread':[2],
'objective':['binary:logistic'],
'learning_rate': [.03, 0.05, .07],
'max_depth': [5, 6, 7, 8],
'min_child_weight': [4],
'subsample': [0.7],
'colsample_bytree': [0.7],
'n_estimators': [500]}
#Create the XGBoost Classifier - Set loss function as logistic
xg_regression_model = xg.XGBClassifier(objective ='binary:logistic',verbosity = 0)
#Hypertune
grid = GridSearchCV(xg_regression_model, model_parameters)
grid.fit(X_train, y_train)
best_parameters=grid.best_params_
xg_regression_model = grid.best_estimator_
#Score the model
score=xg_regression_model.score(X_train,y_train)
print("Model Training Score: {}%".format(score*100))
###-BEGIN TESTING ON TRAIN DATA - DO NOT USE THIS, INVALID BUT REQUIRED IN Q!!!!
print("---------------")
print("---------------")
print("As part of 3 we are meant to predict the data used to train the model (???)")
inv_lin_prediction=xg_regression_model.predict(X_train)
inv_linear_prediction_classified=np.where(inv_lin_prediction>=threshhold_class,1,0)
if plot_comp:
#Original Versus Prediction
print("The Original Test Vs Predicted Result for Test Set Is:")
plt.figure(figsize=(50,20))
#Each test is a point
x_axis = range(len(y_train))
plt.plot(x_axis, y_train, label="Original-Test")
#Add second plot for perdiction Class
plt.plot(x_axis, inv_linear_prediction_classified, label="Predicted-Test")
plt.title("COVID test and predicted data")
plt.legend()
#plt.savefig('./lin_pred_vs_orig_covid.png')
plt.show()
#Results
inv_pred_vs_act_df=pd.DataFrame({'Actual':y_train
,'Predicted':inv_lin_prediction
,'PredictionClass':inv_linear_prediction_classified})
inv_pred_vs_act_df['Predicted']=inv_pred_vs_act_df['Predicted']
inv_pred_vs_act_df['Diff']=inv_pred_vs_act_df['Actual']-inv_pred_vs_act_df['PredictionClass']
print("As required the First Ten Results predicting for the training data:")
display(inv_pred_vs_act_df.head(10))
#Metrics
inv_model_metric=model_metrics(testActualVal=y_train, predictions=inv_linear_prediction_classified, verbose=True)
print("---------------")
print("---------------")
##---END OF CHECKING ON TRAIN DATA
#Check the predictions
model_prediction = xg_regression_model.predict(X_test)
#Classify them
model_prediction_classified=np.where(model_prediction>=threshhold_class,1,0)
kfold = KFold(n_splits=5)
results = cross_val_score(xg_regression_model, X, y, cv=kfold)
print("Model Accuracy: {}".format(results * 100))
if plot_comp:
#Original Versus Prediction
print("The Original Vs Predicted Result Is:")
plt.figure(figsize=(50,20))
x_axis = range(len(y_test))
plt.plot(x_axis, y_test, label="Original")
plt.plot(x_axis, model_prediction_classified, label="Predicted")
plt.title("COVID test and predicted data")
plt.legend()
plt.savefig('xg_pred_vs_orig_covid.png')
plt.show()
filename = './xg_model_covid.pickle'
xg_regression_model.save_model(filename)
#Results
pred_vs_act_df=pd.DataFrame({'Actual':y_test
,'Predicted':model_prediction
,'PredictionClass':model_prediction_classified})
pred_vs_act_df['Predicted']=pred_vs_act_df['Predicted']
pred_vs_act_df['Diff']=pred_vs_act_df['Actual']-pred_vs_act_df['PredictionClass']
#Metrics
model_metric=model_metrics(testActualVal=y_test, predictions=model_prediction_classified, verbose=True)
result_dict={}
result_dict['Model']=xg_regression_model
result_dict['Actual vs Prediction']=pred_vs_act_df
result_dict['Model_Coefficients']=None
result_dict['RMSE']=model_metric['RMSE']
result_dict['MSE']=model_metric['MSE']
result_dict['MAE']=model_metric['MAE']
result_dict['Accuracy']=model_metric['Accuracy']
result_dict['Confusion']=model_metric['Confusion']
result_dict['ClassificationRep']=model_metric['ClassificationRep']
result_dict['CrossVal_RMSE_MEAN']=np.mean(results)
result_dict['CrossVal_RMSE_STD']=np.std(results)
print("Importance by Booster Plot")
xg.plot_importance(xg_regression_model.get_booster())
print("Importance by Weight:")
xg_regression_model.get_booster().get_score(importance_type='weight')
if plot_tree:
#Visualisations, sometimes not great
try:
#Tree Plot
print("The Tree Is:")
fig, ax = plt.subplots(figsize=(100, 100))
xg.plot_tree(xg_regression_model,num_trees=2,ax=ax)
plt.savefig('xg_tree_covid.png')
plt.show()
except Exception as e:
print(e)
return result_dict
def get_randomised_data(df,test_size=0.3,random=False):
"""A Function to split a dataframe into a training and test set"""
if random:
train, test = train_test_split(df, test_size=test_size)
else:
train, test = train_test_split(df, test_size=test_size,random_state=14395076)
split_dict={}
split_dict['Train']=train
split_dict['Test']=test
return split_dict
def ingest_cleansed_pickle_covid_data(fp,verbose=False):
"""A function to read in CSV Data and Validate """
print("Inside ingest_cleansed_pickle_covid_data({})".format(fp))
try:
#Valid Filepath
if os.path.isfile(fp):
#read_csv - Do Not Let Pandas Manipulate the Data First - Auto-assign is more memory intensive.
raw_df=pd.read_pickle(fp)
#row_column data
shape_of_df=raw_df.shape
row_count=shape_of_df[0]
column_count=shape_of_df[1]
#print info to user
row_column_print_statement='Your file contains: \n{} rows x {} columns.\n\n'
row_column_print_statement=row_column_print_statement.format(row_count,column_count)
if verbose:
print(row_column_print_statement)
header_statement='The following columns are present:\n'
#print the headers
for header in raw_df.columns:
header_statement+='"{}"\n'.format(header)
if verbose:
print(header_statement)
if "onset_present" in raw_df.columns:
raw_df['onset_present']=raw_df['onset_present'].astype("category")
if verbose:
print("The Column Types are:\n{}\n".format(raw_df.dtypes))
return raw_df
#Not Valid Filepath
else:
print("Invalid filepath - Correct the filepath and re-ingest")
return
except Exception as e:
print(e)
if os.path.isfile(fp):
new_fp="00F_adf_covid_data_14395076.csv"
df=pd.read_csv(new_fp,dtype=str)
for column in df.columns:
if column in ["days_until_onset"]:
df[column]=pd.to_numeric(df[column])
if column in ["cdc_case_earliest_dt"]:
df[column]=pd.to_datetime(df[column])
if column not in ["cdc_case_earliest_dt","days_until_onset"]:
df[column]=df[column].astype("category")
return df
else:
print("Missing required File")
return
def create_correlation_heatmap(df,pdf_fn,target_list=None,savefig=False):
"""A function to create a seaborn heatmap and save to file"""
correlation_df = df.corr()
if isinstance(target_list,list):
correlation_df=df[df.index.isin(target_list)]
plt.subplots(figsize=(50, 50))
sns.heatmap(correlation_df
, xticklabels=correlation_df.columns
, yticklabels=correlation_df.columns
, annot=True)
if savefig:
with PdfPages(pdf_fn) as pp:
pp.savefig(plt.gcf())
return correlation_df
#This function is used repeatedly to compute all metrics - This is from Lab 7
def model_metrics(testActualVal, predictions,verbose=True):
#classification evaluation measures
MAE=metrics.mean_absolute_error(testActualVal, predictions)
MSE=metrics.mean_squared_error(testActualVal, predictions)
RMSE=metrics.mean_squared_error(testActualVal, predictions)**0.5
R2=metrics.r2_score(testActualVal, predictions)
try:
accuracy=metrics.accuracy_score(testActualVal, predictions)
except:
accuracy=None
pass
try:
confusion_matrix=metrics.confusion_matrix(testActualVal, predictions)
except:
confusion_matrix=None
pass
try:
classification_rep=metrics.classification_report(testActualVal, predictions,output_dict=True)
except:
classification_rep=None
pass
if verbose:
print("----REPORT----")
print("MAE: ", metrics.mean_absolute_error(testActualVal, predictions))
print("MSE: ", metrics.mean_squared_error(testActualVal, predictions))
print("RMSE: ", metrics.mean_squared_error(testActualVal, predictions)**0.5)
print("R2: ", metrics.r2_score(testActualVal, predictions))
print("----DETAIL----")
print("\n\nAccuracy: \n", accuracy)
print("\n\nConfusion matrix: \n", confusion_matrix)
print("\n\nClassification report:\n ", classification_rep)
result_dict={}
result_dict['RMSE']=RMSE
result_dict['MAE']=MAE
result_dict['MSE']=MSE
result_dict['R2']=R2
result_dict['Accuracy']=accuracy
result_dict['Confusion']=confusion_matrix
result_dict['ClassificationRep']=classification_rep
return result_dict
def everyone_lives_model(fulldf,train_df,test_df,target_column,plot_comp, threshhold_class=0.5, assess=True, verbose=True):
"""Create a model where everybody lives"""
#Full Set
X=fulldf.drop([target_column], axis=1)
y=fulldf[target_column]
#Test
X_test=test_df.drop([target_column], axis=1)
y_test= test_df[target_column]
#Train
X_train=train_df.drop([target_column], axis=1)
y_train=train_df[target_column]
#Check the predictions
lin_prediction = np.array([0.0 for x in y_test])
#Classify them
linear_prediction_classified=np.where(lin_prediction>=threshhold_class,1,0)
#Plot it
if plot_comp:
#Original Versus Prediction
print("The Original Vs Predicted Result Is:")
plt.figure(figsize=(50,20))
#Each test is a point
x_axis = range(len(y_test))
plt.plot(x_axis, y_test, label="Original")
#Add second plot for perdiction Class
plt.plot(x_axis, linear_prediction_classified, label="Predicted")
plt.title("COVID test and predicted data - Everyione Lives")
plt.legend()
plt.savefig('./all_living_pred_vs_orig_covid.png')
plt.show()
#Save the Model
#filename = './live_model_covid.pickle'
#pickle.dump(lin_regression_model, open(filename, 'wb'))
coeff_statement="{}=0 + 0".format(target_column)
if verbose:
#Print details on the coefficients
print(coeff_statement)
###-BEGIN TESTING ON TRAIN DATA - DO NOT USE THIS, INVALID BUT REQUIRED IN Q!!!!
print("---------------")
print("---------------")
print("As part of 3 we are meant to predict the data used to train the model (???)")
inv_lin_prediction=np.array([0.0 for x in y_train])
inv_linear_prediction_classified=np.where(inv_lin_prediction>=threshhold_class,1,0)
if plot_comp:
#Original Versus Prediction
print("The Original Test Vs Predicted Result for Test Set Is:")
plt.figure(figsize=(50,20))
#Each test is a point
x_axis = range(len(y_train))
plt.plot(x_axis, y_train, label="Original-Test")
#Add second plot for perdiction Class
plt.plot(x_axis, inv_linear_prediction_classified, label="Predicted-Test")
plt.title("COVID test and predicted data")
plt.legend()
#plt.savefig('./lin_pred_vs_orig_covid.png')
plt.show()
#Results
inv_pred_vs_act_df=pd.DataFrame({'Actual':y_train
,'Predicted':inv_lin_prediction
,'PredictionClass':inv_linear_prediction_classified})
inv_pred_vs_act_df['Predicted']=inv_pred_vs_act_df['Predicted']
inv_pred_vs_act_df['Diff']=inv_pred_vs_act_df['Actual']-inv_pred_vs_act_df['PredictionClass']
print("As required the First Ten Results predicting for the training data:")
display(inv_pred_vs_act_df.head(10))
#Metrics
inv_model_metric=model_metrics(testActualVal=y_train, predictions=inv_linear_prediction_classified, verbose=True)
print("---------------")
print("---------------")
##---END OF CHECKING ON TEST DATA
#Results
pred_vs_act_df=pd.DataFrame({'Actual':y_test
,'Predicted':lin_prediction
,'PredictionClass':linear_prediction_classified})
pred_vs_act_df['Predicted']=pred_vs_act_df['Predicted']
pred_vs_act_df['Diff']=pred_vs_act_df['Actual']-pred_vs_act_df['PredictionClass']
#Metrics
model_metric=model_metrics(testActualVal=y_test, predictions=linear_prediction_classified, verbose=True)
result_dict={}
result_dict['Model']=0
result_dict['Actual vs Prediction']=pred_vs_act_df
result_dict['Model_Coefficients']=zip(X_train.columns,[0 for c in X_train.columns])
result_dict['RMSE']=model_metric['RMSE']
result_dict['MSE']=model_metric['MSE']
result_dict['MAE']=model_metric['MAE']
result_dict['Accuracy']=model_metric['Accuracy']
result_dict['Confusion']=model_metric['Confusion']
result_dict['ClassificationRep']=model_metric['ClassificationRep']
result_dict['CrossVal_RMSE_MEAN']=None
result_dict['CrossVal_RMSE_STD']=None
return result_dict
def create_linear_model(fulldf,train_df,test_df,target_column,plot_comp, threshhold_class=0.5, assess=True, verbose=True):
"""Create a linear model"""
#Full Set
X=fulldf.drop([target_column], axis=1)
y=fulldf[target_column]
#Test
X_test=test_df.drop([target_column], axis=1)
y_test= test_df[target_column]
#Train
X_train=train_df.drop([target_column], axis=1)
y_train=train_df[target_column]
#Create the Linear Regression
lin_regression_model = sk.linear_model.LinearRegression()
#Fit the data
lin_regression_model.fit(X_train,y_train)
#--Check the coefficients--##
coeff_statement="{} = ".format(target_column)
feature_statement="\n{}('{}' * {})"
for feature_index in range(len(X_train.columns)):
if feature_index==0:
coeff_statement+=feature_statement.format('',X_train.columns[feature_index]
,lin_regression_model.coef_[feature_index])
else:
coeff_statement+=feature_statement.format('+',X_train.columns[feature_index]
,lin_regression_model.coef_[feature_index])
coeff_statement+=" + ({})\n\n\n\n".format(lin_regression_model.intercept_)
if verbose:
#Print details on the coefficients
print(coeff_statement)
#--End Coefficient Check--##
###-BEGIN TESTING ON TRAIN DATA - DO NOT USE THIS, INVALID BUT REQUIRED IN Q!!!!
print("---------------")
print("---------------")
print("As part of 3 we are meant to predict the data used to train the model (???)")
inv_lin_prediction=lin_regression_model.predict(X_train)
inv_linear_prediction_classified=np.where(inv_lin_prediction>=threshhold_class,1,0)
if plot_comp:
#Original Versus Prediction
print("The Original Test Vs Predicted Result for Test Set Is:")
plt.figure(figsize=(50,20))
#Each test is a point
x_axis = range(len(y_train))
plt.plot(x_axis, y_train, label="Original-Test")
#Add second plot for perdiction Class
plt.plot(x_axis, inv_linear_prediction_classified, label="Predicted-Test")
plt.title("COVID test and predicted data")
plt.legend()
#plt.savefig('./lin_pred_vs_orig_covid.png')
plt.show()
#Results
inv_pred_vs_act_df=pd.DataFrame({'Actual':y_train
,'Predicted':inv_lin_prediction
,'PredictionClass':inv_linear_prediction_classified})
inv_pred_vs_act_df['Predicted']=inv_pred_vs_act_df['Predicted']
inv_pred_vs_act_df['Diff']=inv_pred_vs_act_df['Actual']-inv_pred_vs_act_df['PredictionClass']
print("As required the First Ten Results predicting for the training data:")
display(inv_pred_vs_act_df.head(10))
#Metrics
inv_model_metric=model_metrics(testActualVal=y_train, predictions=inv_linear_prediction_classified, verbose=True)
print("---------------")
print("---------------")
##---END OF CHECKING ON TEST DATA
print("---------------")
print("---------------")
print(" PROPER RESULT FROM TEST DATA:")
#Check the predictions
lin_prediction = lin_regression_model.predict(X_test)
#Classify them
linear_prediction_classified=np.where(lin_prediction>=threshhold_class,1,0)
#Plot it
if plot_comp:
#Original Versus Prediction
print("The Original Vs Predicted Result Is:")
plt.figure(figsize=(50,20))
#Each test is a point
x_axis = range(len(y_test))
plt.plot(x_axis, y_test, label="Original")
#Add second plot for perdiction Class
plt.plot(x_axis, linear_prediction_classified, label="Predicted")
plt.title("COVID test and predicted data")
plt.legend()
plt.savefig('./lin_pred_vs_orig_covid.png')
plt.show()
#Save the Model
filename = './lin_model_covid.pickle'
pickle.dump(lin_regression_model, open(filename, 'wb'))
#Results
pred_vs_act_df=pd.DataFrame({'Actual':y_test
,'Predicted':lin_prediction
,'PredictionClass':linear_prediction_classified})
pred_vs_act_df['Predicted']=pred_vs_act_df['Predicted']
pred_vs_act_df['Diff']=pred_vs_act_df['Actual']-pred_vs_act_df['PredictionClass']
#Metrics
model_metric=model_metrics(testActualVal=y_test, predictions=linear_prediction_classified, verbose=True)
scores = -cross_val_score(sk.linear_model.LinearRegression(), X, y, scoring='neg_mean_squared_error', cv=5)
print(scores)
cv_rmse = scores**0.5
print("Avg RMSE score over 5 folds: \n", np.mean(cv_rmse))
print("Stddev RMSE score over 5 folds: \n", np.std(cv_rmse))
result_dict={}
result_dict['Model']=lin_regression_model
result_dict['Model_Coefficients']=zip(X_train.columns,lin_regression_model.coef_)
result_dict['Actual vs Prediction']=pred_vs_act_df
result_dict['RMSE']=model_metric['RMSE']
result_dict['MSE']=model_metric['MSE']
result_dict['MAE']=model_metric['MAE']
result_dict['Accuracy']=model_metric['Accuracy']
result_dict['Confusion']=model_metric['Confusion']
result_dict['ClassificationRep']=model_metric['ClassificationRep']
result_dict['CrossVal_RMSE_MEAN']=np.mean(cv_rmse)
result_dict['CrossVal_RMSE_STD']=np.std(cv_rmse)
return result_dict
def create_logistic_model(fulldf,train_df,test_df,target_column,plot_comp, threshhold_class=0.5, assess=True, verbose=True):
"""Create a logistic model
Note: Variable names say linear only as that was built first."""
#Full Set
X=fulldf.drop([target_column], axis=1)
y=fulldf[target_column]
#Test
X_test=test_df.drop([target_column], axis=1)
y_test= test_df[target_column]
#Train
X_train=train_df.drop([target_column], axis=1)
y_train=train_df[target_column]
#Create the Linear Regression
lin_regression_model = sk.linear_model.LogisticRegression()
#Fit the data
lin_regression_model.fit(X_train,y_train)
#--Check the coefficients--##
coeff_statement="{} = logistic(".format(target_column)
feature_statement="\n{}('{}' * {})"
for feature_index in range(len(X_train.columns)):
if feature_index==0:
coeff_statement+=feature_statement.format('',X_train.columns[feature_index]
,lin_regression_model.coef_[0][feature_index])
else:
coeff_statement+=feature_statement.format('+',X_train.columns[feature_index]
,lin_regression_model.coef_[0][feature_index])
coeff_statement+=" + ({}))\n\n\n\n".format(lin_regression_model.intercept_)
if verbose:
#Print details on the coefficients
print(coeff_statement)
#--End Coefficient Check--##
###-BEGIN TESTING ON TRAIN DATA - DO NOT USE THIS, INVALID BUT REQUIRED IN Q!!!!
print("---------------")
print("---------------")
print("As part of 3 we are meant to predict the data used to train the model (???)")
inv_lin_prediction=lin_regression_model.predict(X_train)
inv_linear_prediction_classified=np.where(inv_lin_prediction>=threshhold_class,1,0)
if plot_comp:
#Original Versus Prediction
print("The Original Test Vs Predicted Result for Test Set Is:")
plt.figure(figsize=(50,20))
#Each test is a point
x_axis = range(len(y_train))
plt.plot(x_axis, y_train, label="Original-Test")
#Add second plot for perdiction Class
plt.plot(x_axis, inv_linear_prediction_classified, label="Predicted-Test")
plt.title("COVID test and predicted data")
plt.legend()
#plt.savefig('./lin_pred_vs_orig_covid.png')
plt.show()
#Results
inv_pred_vs_act_df=pd.DataFrame({'Actual':y_train
,'Predicted':inv_lin_prediction
,'PredictionClass':inv_linear_prediction_classified})
inv_pred_vs_act_df['Predicted']=inv_pred_vs_act_df['Predicted']
inv_pred_vs_act_df['Diff']=inv_pred_vs_act_df['Actual']-inv_pred_vs_act_df['PredictionClass']
print("As required the First Ten Results predicting for the training data:")
display(inv_pred_vs_act_df.head(10))
#Metrics
inv_model_metric=model_metrics(testActualVal=y_train, predictions=inv_linear_prediction_classified, verbose=True)
print("---------------")
print("---------------")
##---END OF CHECKING ON TEST DATA
print("---------------")
print("---------------")
print(" PROPER RESULT FROM TEST DATA:")
#Check the predictions
lin_prediction = lin_regression_model.predict(X_test)
#Classify them - Note: Not needed for logistic but keeping as no impact
linear_prediction_classified=np.where(lin_prediction>=threshhold_class,1,0)
#Plot it
if plot_comp:
#Original Versus Prediction
print("Log: The Original Vs Predicted Result Is:")
plt.figure(figsize=(50,20))
#Each test is a point
x_axis = range(len(y_test))
plt.plot(x_axis, y_test, label="Original")
#Add second plot for perdiction Class
plt.plot(x_axis, linear_prediction_classified, label="Predicted")
plt.title("COVID test and predicted data")
plt.legend()
plt.savefig('./log_pred_vs_orig_covid.png')
plt.show()
#Save the Model
filename = './log_model_covid.pickle'
pickle.dump(lin_regression_model, open(filename, 'wb'))
#Results
pred_vs_act_df=pd.DataFrame({'Actual':y_test
,'Predicted':lin_prediction
,'PredictionClass':linear_prediction_classified})
pred_vs_act_df['Predicted']=pred_vs_act_df['Predicted']
pred_vs_act_df['Diff']=pred_vs_act_df['Actual']-pred_vs_act_df['PredictionClass']
#Metrics
model_metric=model_metrics(testActualVal=y_test, predictions=linear_prediction_classified, verbose=True)
scores = -cross_val_score(sk.linear_model.LogisticRegression(), X, y, scoring='neg_mean_squared_error', cv=5)
print(scores)
cv_rmse = scores**0.5
print("Avg RMSE score over 5 folds: \n", np.mean(cv_rmse))
print("Stddev RMSE score over 5 folds: \n", np.std(cv_rmse))
result_dict={}
result_dict['Model']=lin_regression_model
result_dict['Model_Coefficients']=zip(X_train.columns,lin_regression_model.coef_)
result_dict['Actual vs Prediction']=pred_vs_act_df
result_dict['RMSE']=model_metric['RMSE']
result_dict['MSE']=model_metric['MSE']
result_dict['MAE']=model_metric['MAE']
result_dict['Accuracy']=model_metric['Accuracy']
result_dict['Confusion']=model_metric['Confusion']
result_dict['ClassificationRep']=model_metric['ClassificationRep']
result_dict['CrossVal_RMSE_MEAN']=np.mean(cv_rmse)
result_dict['CrossVal_RMSE_STD']=np.std(cv_rmse)
return result_dict
def create_RandomForest_model(fulldf,train_df,test_df,target_column,plot_comp, threshhold_class=0.5, assess=True, verbose=True):
"""Create a random forest model
Note: The variable names say linear but that's only because it was first built for a linear model. Observe it does initialise the correct model type."""
#Full Set
X=fulldf.drop([target_column], axis=1)
y=fulldf[target_column]
#Test
X_test=test_df.drop([target_column], axis=1)
y_test= test_df[target_column]
#Train
X_train=train_df.drop([target_column], axis=1)
y_train=train_df[target_column]
#Create the DT Regression
lin_regression_model = RandomForestClassifier(n_estimators=100, max_features='auto', oob_score=True, random_state=14395076)
#Fit the data
lin_regression_model.fit(X_train,y_train)
#--End Coefficient Check--##
###-BEGIN TESTING ON TRAIN DATA - DO NOT USE THIS, INVALID BUT REQUIRED IN Q!!!!
print("---------------")
print("---------------")
print("As part of 3 we are meant to predict the data used to train the model (???)")
inv_lin_prediction=lin_regression_model.predict(X_train)
inv_linear_prediction_classified=np.where(inv_lin_prediction>=threshhold_class,1,0)
if plot_comp:
#Original Versus Prediction
print("The Original Test Vs Predicted Result for Test Set Is:")
plt.figure(figsize=(50,20))
#Each test is a point
x_axis = range(len(y_train))
plt.plot(x_axis, y_train, label="Original-Test")
#Add second plot for perdiction Class
plt.plot(x_axis, inv_linear_prediction_classified, label="Predicted-Test")
plt.title("COVID test and predicted data")
plt.legend()
#plt.savefig('./lin_pred_vs_orig_covid.png')
plt.show()
#Results
inv_pred_vs_act_df=pd.DataFrame({'Actual':y_train
,'Predicted':inv_lin_prediction
,'PredictionClass':inv_linear_prediction_classified})
inv_pred_vs_act_df['Predicted']=inv_pred_vs_act_df['Predicted']
inv_pred_vs_act_df['Diff']=inv_pred_vs_act_df['Actual']-inv_pred_vs_act_df['PredictionClass']
print("As required the First Ten Results predicting for the training data:")
display(inv_pred_vs_act_df.head(10))
#Metrics
inv_model_metric=model_metrics(testActualVal=y_train, predictions=inv_linear_prediction_classified, verbose=True)
print("---------------")
print("---------------")
##---END OF CHECKING ON TEST DATA
print("---------------")
print("---------------")
print(" PROPER RESULT FROM TEST DATA:")
#Check the predictions
lin_prediction = lin_regression_model.predict(X_test)
#Classify them - Note: Not needed for logistic but keeping as no impact
linear_prediction_classified=np.where(lin_prediction>=threshhold_class,1,0)
#Plot it
if plot_comp:
#Original Versus Prediction
print("Random Forest: The Original Vs Predicted Result Is:")
plt.figure(figsize=(50,20))
#Each test is a point
x_axis = range(len(y_test))
plt.plot(x_axis, y_test, label="Original")
#Add second plot for perdiction Class
plt.plot(x_axis, linear_prediction_classified, label="Predicted")
plt.title("COVID test and predicted data")
plt.legend()
plt.savefig('./randomforest_pred_vs_orig_covid.png')
plt.show()
#Save the Model
filename = './randomforest_model_covid.pickle'
pickle.dump(lin_regression_model, open(filename, 'wb'))
feature_imp=pd.DataFrame({'feature': X_train.columns, 'importance': lin_regression_model.feature_importances_})
feature_imp=feature_imp.set_index('feature')
feature_imp=feature_imp.sort_values('importance', axis=0, ascending=False)
if plot_comp:
print("Random Forest: Feature Importance:")
plt.figure(figsize=(50,20))
#Plot from DF
feature_imp.plot(kind='barh')
#Add second plot for perdiction Class
plt.title("Random Forest Feature Importance")
plt.legend()
plt.savefig('./randomforest_importance_covid.png')
plt.show()
#Results
pred_vs_act_df=pd.DataFrame({'Actual':y_test
,'Predicted':lin_prediction
,'PredictionClass':linear_prediction_classified})
pred_vs_act_df['Predicted']=pred_vs_act_df['Predicted']
pred_vs_act_df['Diff']=pred_vs_act_df['Actual']-pred_vs_act_df['PredictionClass']
#Metrics
model_metric=model_metrics(testActualVal=y_test, predictions=linear_prediction_classified, verbose=True)
scores = cross_val_score(RandomForestClassifier(n_estimators=100, max_features='auto', oob_score=True, random_state=1), X, y, scoring='accuracy', cv=5)
print(scores)
cv_rmse = scores**0.5
print("Avg Accuracy score over 5 folds: \n", np.mean(cv_rmse))
print("Stddev Accuracy score over 5 folds: \n", np.std(cv_rmse))
result_dict={}
result_dict['Model']=lin_regression_model
result_dict['Model_Coefficients']=None#zip(X_train.columns,lin_regression_model.coef_)
result_dict['Actual vs Prediction']=pred_vs_act_df
result_dict['RMSE']=model_metric['RMSE']
result_dict['MSE']=model_metric['MSE']
result_dict['MAE']=model_metric['MAE']
result_dict['Accuracy']=model_metric['Accuracy']
result_dict['Confusion']=model_metric['Confusion']
result_dict['ClassificationRep']=model_metric['ClassificationRep']
result_dict['CrossVal_Acc_Mean']=np.mean(cv_rmse)
result_dict['CrossVal_Acc_Mean']=np.std(cv_rmse)
result_dict['FeatureImportance']=feature_imp
return result_dict
def create_model(fulldf,train_df,test_df,target_column,plot_comp, threshhold_class=0.5, assess=True, verbose=True,mod_type=''):
"""A wrapper to call the create model function"""
if mod_type=='Live':
mod_result=everyone_lives_model(fulldf
,train_df
,test_df
,target_column
,plot_comp
, threshhold_class
, assess
, verbose)
return mod_result
elif mod_type=='Linear':
mod_result=create_linear_model(fulldf
,train_df
,test_df
,target_column
,plot_comp
, threshhold_class
, assess
, verbose)
return mod_result
elif mod_type=='Logistic':
mod_result=create_logistic_model(fulldf
,train_df
,test_df
,target_column
,plot_comp
, threshhold_class
, assess
, verbose)
return mod_result
elif mod_type=='Random Forest':
mod_result=create_RandomForest_model(fulldf
,train_df
,test_df
,target_column
,plot_comp
, threshhold_class
, assess
, verbose)
return mod_result
elif mod_type=="XGBoost":
mod_result=create_xgboost_model(fulldf
,train_df
,test_df
,target_column
,plot_comp
, threshhold_class)
return mod_result
else:
print("Sorry, I've not build that yet.")
def compare_models(model_report_list,keys):
"""A function to compare models and plot it"""
comparison_df = pd.concat(model_report_list, keys=keys)
comp_plot_df=comparison_df.reset_index()
comp_plot_df=comp_plot_df.rename(columns={"level_0":"Model","level_1":"Type"})
print("Comparison Dataframe is: \n")
display(comp_plot_df)
for column in comp_plot_df:
if column!="Model" and column!="Type":
fig = sns.factorplot(x="Type"
,y=column
#,hue="Type"
,col="Model"
,data=comp_plot_df
,kind='bar')
fig.set_xlabels('Type')
fn='./model_comp_{}_covid.png'.format(column)
plt.savefig(fn)
return comp_plot_df
We have saved the cleansed and extended file in Assignment 1 as a pickle, therefore we already have all of the assignment of column type complete. Columns will retain the same type as from where they were at the end of Assignment 1.
We print some info on what sort of columns are present, how much data we have, and the column types to validate.
Recall that: "onset_present" "cdc_case_earliest_day" "cdc_case_earliest_weekday" "cdc_case_earliest_month" "cdc_case_earliest_year" "demographic_missing" "medical_missing Were features which I added on at the end of Assignment 1. For the initial model creation, we will stick with the original features and drop these. We will return to some of these later in the assignment.
raw_df=ingest_cleansed_pickle_covid_data(fp=extended_pickle_filepath,verbose=True)
Inside ingest_cleansed_pickle_covid_data(00F_adf_covid_data_14395076.pickle) Your file contains: 9548 rows x 17 columns. The following columns are present: "cdc_case_earliest_dt" "current_status" "sex" "age_group" "hosp_yn" "icu_yn" "death_yn" "medcond_yn" "race" "days_until_onset" "onset_present" "cdc_case_earliest_day" "cdc_case_earliest_weekday" "cdc_case_earliest_month" "cdc_case_earliest_year" "demographic_missing" "medical_missing" The Column Types are: cdc_case_earliest_dt datetime64[ns] current_status category sex category age_group category hosp_yn category icu_yn category death_yn category medcond_yn category race category days_until_onset float64 onset_present category cdc_case_earliest_day category cdc_case_earliest_weekday category cdc_case_earliest_month category cdc_case_earliest_year category demographic_missing category medical_missing category dtype: object
This is going to contain the original features which were present.
original_features=["cdc_case_earliest_dt"
,"current_status"
,"sex"
,"age_group"
,"hosp_yn"
,"icu_yn"
,"death_yn"
,"medcond_yn"
,"race"]
staging_df=raw_df[original_features]
categorical_columns=[c for c in staging_df.columns if hasattr(staging_df[c], 'cat')]
target_column='death_yn'
In this section we begin the Assignment 2 Material explicitly.
Most of this is a redundant replica of the analysis in Assignment 1 but directed at the training data specifically. We leverage our defined analytics functions above to allow for a concise analysis of the data which is quickly adapted to new datasets.
We will split the data 70-30 and analyse our test data.
test_train_split=get_randomised_data(df=staging_df,test_size=0.3)
train_data=test_train_split['Train']
test_data=test_train_split['Test']
We see the train and test data is split correctly (i.e. in the right proportion) and the total length matches the original data set. We are now good to start analysing our training data.
This corresponds with task 1.1
print("There are {} rows in the {} data".format(len(train_data),'train'))
print("There are {} rows in the {} data".format(len(test_data),'test'))
display(len(train_data)+len(test_data)==len(staging_df))
display(staging_df.head())
display(staging_df.describe().T)
There are 6683 rows in the train data There are 2865 rows in the test data
True
| cdc_case_earliest_dt | current_status | sex | age_group | hosp_yn | icu_yn | death_yn | medcond_yn | race | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 2020-09-30 | Laboratory-confirmed case | Female | 50 - 59 Years | Unknown | Unknown | No | Unknown | Unknown |
| 1 | 2020-04-16 | Laboratory-confirmed case | Male | 50 - 59 Years | Unknown | Unknown | No | Unknown | Unknown |
| 2 | 2020-09-22 | Laboratory-confirmed case | Female | 10 - 19 Years | No | No | No | No | White |
| 3 | 2020-10-30 | Laboratory-confirmed case | Female | 60 - 69 Years | No | Unknown | No | Unknown | Multiple/Other |
| 4 | 2020-12-17 | Laboratory-confirmed case | Male | 40 - 49 Years | Unknown | Unknown | No | Unknown | Unknown |
| count | unique | top | freq | first | last | |
|---|---|---|---|---|---|---|
| cdc_case_earliest_dt | 9548 | 325 | 2020-12-29 00:00:00 | 102 | 2020-01-02 | 2021-01-16 |
| current_status | 9548 | 2 | Laboratory-confirmed case | 9106 | NaT | NaT |
| sex | 9548 | 3 | Female | 5089 | NaT | NaT |
| age_group | 9548 | 10 | 20 - 29 Years | 1718 | NaT | NaT |
| hosp_yn | 9548 | 3 | No | 5156 | NaT | NaT |
| icu_yn | 9548 | 3 | Unknown | 8509 | NaT | NaT |
| death_yn | 9548 | 2 | No | 9216 | NaT | NaT |
| medcond_yn | 9548 | 3 | Unknown | 7791 | NaT | NaT |
| race | 9548 | 8 | Unknown | 3787 | NaT | NaT |
We plot all pairs of categorical features against the target column.
death_data=stacked_group_over_target_categories(df=train_data,categorical_columns=categorical_columns,pdf_fn='./Training_Target_Analysis.pdf',save_output=True,save_fig=True)
Inside group_over_multi_categories() ---------------------- Grouping over current_status|death_yn results in:
| current_status | death_yn | Rows | % Frequency | |
|---|---|---|---|---|
| 0 | Laboratory-confirmed case | No | 6168 | 92.293880 |
| 1 | Laboratory-confirmed case | Yes | 215 | 3.217118 |
| 2 | Probable Case | No | 281 | 4.204698 |
| 3 | Probable Case | Yes | 19 | 0.284303 |
---------------------- Grouping over sex|death_yn results in:
| sex | death_yn | Rows | % Frequency | |
|---|---|---|---|---|
| 0 | Female | No | 3412 | 51.054915 |
| 1 | Female | Yes | 120 | 1.795601 |
| 2 | Male | No | 2987 | 44.695496 |
| 3 | Male | Yes | 113 | 1.690857 |
| 4 | Unknown | No | 50 | 0.748167 |
| 5 | Unknown | Yes | 1 | 0.014963 |
---------------------- Grouping over age_group|death_yn results in:
| age_group | death_yn | Rows | % Frequency | |
|---|---|---|---|---|
| 0 | 0 - 9 Years | No | 309 | 4.623672 |
| 1 | 0 - 9 Years | Yes | 0 | 0.000000 |
| 2 | 10 - 19 Years | No | 703 | 10.519228 |
| 3 | 10 - 19 Years | Yes | 0 | 0.000000 |
| 4 | 20 - 29 Years | No | 1196 | 17.896154 |
| 5 | 20 - 29 Years | Yes | 0 | 0.000000 |
| 6 | 30 - 39 Years | No | 1060 | 15.861140 |
| 7 | 30 - 39 Years | Yes | 1 | 0.014963 |
| 8 | 40 - 49 Years | No | 975 | 14.589256 |
| 9 | 40 - 49 Years | Yes | 6 | 0.089780 |
| 10 | 50 - 59 Years | No | 931 | 13.930869 |
| 11 | 50 - 59 Years | Yes | 12 | 0.179560 |
| 12 | 60 - 69 Years | No | 666 | 9.965584 |
| 13 | 60 - 69 Years | Yes | 44 | 0.658387 |
| 14 | 70 - 79 Years | No | 378 | 5.656142 |
| 15 | 70 - 79 Years | Yes | 63 | 0.942690 |
| 16 | 80+ Years | No | 222 | 3.321861 |
| 17 | 80+ Years | Yes | 107 | 1.601077 |
| 18 | Unknown | No | 9 | 0.134670 |
| 19 | Unknown | Yes | 1 | 0.014963 |
---------------------- Grouping over hosp_yn|death_yn results in:
| hosp_yn | death_yn | Rows | % Frequency | |
|---|---|---|---|---|
| 0 | No | No | 3561 | 53.284453 |
| 1 | No | Yes | 33 | 0.493790 |
| 2 | OTH | No | 0 | 0.000000 |
| 3 | OTH | Yes | 0 | 0.000000 |
| 4 | Unknown | No | 2529 | 37.842286 |
| 5 | Unknown | Yes | 61 | 0.912764 |
| 6 | Yes | No | 359 | 5.371839 |
| 7 | Yes | Yes | 140 | 2.094868 |
---------------------- Grouping over icu_yn|death_yn results in:
| icu_yn | death_yn | Rows | % Frequency | |
|---|---|---|---|---|
| 0 | No | No | 627 | 9.382014 |
| 1 | No | Yes | 22 | 0.329193 |
| 2 | Unknown | No | 5793 | 86.682628 |
| 3 | Unknown | Yes | 184 | 2.753255 |
| 4 | Yes | No | 29 | 0.433937 |
| 5 | Yes | Yes | 28 | 0.418974 |
---------------------- Grouping over medcond_yn|death_yn results in:
| medcond_yn | death_yn | Rows | % Frequency | |
|---|---|---|---|---|
| 0 | No | No | 627 | 9.382014 |
| 1 | No | Yes | 6 | 0.089780 |
| 2 | Unknown | No | 5308 | 79.425408 |
| 3 | Unknown | Yes | 165 | 2.468951 |
| 4 | Yes | No | 514 | 7.691157 |
| 5 | Yes | Yes | 63 | 0.942690 |
---------------------- Grouping over race|death_yn results in:
| race | death_yn | Rows | % Frequency | |
|---|---|---|---|---|
| 0 | American Indian/Alaska Native | No | 41 | 0.613497 |
| 1 | American Indian/Alaska Native | Yes | 3 | 0.044890 |
| 2 | Asian | No | 165 | 2.468951 |
| 3 | Asian | Yes | 7 | 0.104743 |
| 4 | Black | No | 471 | 7.047733 |
| 5 | Black | Yes | 26 | 0.389047 |
| 6 | Hispanic/Latino | No | 625 | 9.352087 |
| 7 | Hispanic/Latino | Yes | 21 | 0.314230 |
| 8 | Multiple/Other | No | 351 | 5.252132 |
| 9 | Multiple/Other | Yes | 7 | 0.104743 |
| 10 | Native Hawaiian/Other Pacific Islander | No | 18 | 0.269340 |
| 11 | Native Hawaiian/Other Pacific Islander | Yes | 0 | 0.000000 |
| 12 | Unknown | No | 2606 | 38.994464 |
| 13 | Unknown | Yes | 45 | 0.673350 |
| 14 | White | No | 2172 | 32.500374 |
| 15 | White | Yes | 125 | 1.870417 |
Probable cases interestingly are slightly more like than laboratory confirmed cases to have resulted in death. Suspicion is that this is due to retrospective classification of this data. Overall probably case volume is small (>5% of data) so this is likely not a significant indicator for future data.
Males are more likely to be flagged as death in the training data set. There is potentially a higher risk for males. A key factor as to why this might be is that the life expectancy for males in the US is lower than females. Particularly in the older categories, males might be at a more pronounced risk due to overall lower life expectancy resulting in a greater susceptibility to COVID. Ultimately, the differences are relatively minor by sex, so this is unlikely to have a significant sway within our model given that there are some features (e.g. age, icu, medcond, hosp) which have a more significant correlation.
Age group is a highly significant factor. From 40+, there is a greatly increased likelhiood of death. This aligns highly with what is currently known about COVID, where people in older categories are at a greatly increased risk of significant COVID complications. We see that particularly in people in the 80+ category, there is a very significant motality rate increase, and as such this is likely to be a very important predictor of death in our dataset.
We see that hospitalisation is correlated with an increased likelihood of death within the set which is not too surprising given that those who are hospitalised are more likely to have a more serious presentation of COVID than those who do not require hospitalisation.
We see ICU admission has a very significant impact on whether somebody is likely to be flagged as having died. Similarly to hospitalisation, this is likely because those who require an ICU admission are likely to have an extreme presentation of COVID, and as such being admitted to ICU is likely to be a very strong indicator of your prognosis. The proportion of missing aligns with those who were not admitted to the ICU. As described in HW1, I strongly suspect that a missing ICU indicator is in fact an indicator that the patient was not admitted to the ICU. The proportion of admissions supports this, and as analysed in assignment 1, this is further supported by the ICU flag being most heavily missing in patients in younger age categories; likely for these patients unless the patient was explicitly admitted, the field was left as unchecked resulting in a missing value, whereas if the patient was admitted they are much more likely to have a value flagged.
Regardless, ICU is a clearly promissing indicator.
People with a medical condition noted are at an elevated risk from COVID. While the correlation does not appear to be as strong as ICU admission or being in an older age category, it is likely to be a relevant factor
We see that there are some races which appear to be disproportionally impacted by COVID, but by and large the proportions are similar except for some minority groups which are not heavily featured in the data set. As there does not appear to be a significant correlation, we do not use this.
Based on the above, we elect to include the following predictive features as having the most relevance to our model:
As we do not have continuous features in our data set, this concludes 2.2
predictive_features=["age_group"
,"hosp_yn"
,"icu_yn"
,"medcond_yn"]
keep_features=[target_column]
for column in predictive_features:
keep_features+=[column]
We encode the target feature to an int and then one hot encode our predictive features.
modelling_df=raw_df[keep_features]
#Transform target feature
modelling_df['death_yn']=modelling_df['death_yn'].astype(str)
modelling_df.loc[(modelling_df['death_yn']=='Yes'),'death_yn']=1
modelling_df.loc[(modelling_df['death_yn']=='No'),'death_yn']=0
modelling_df['death_yn']=modelling_df['death_yn'].astype(int)
modelling_dummy_df=pd.get_dummies(modelling_df, columns=predictive_features, drop_first=True)
test_train_split=get_randomised_data(df=modelling_dummy_df,test_size=0.3)
train_data=test_train_split['Train']
test_data=test_train_split['Test']
<ipython-input-24-4de359a3f8b0>:4: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy modelling_df['death_yn']=modelling_df['death_yn'].astype(str) /opt/miniconda3/envs/comp30830/lib/python3.9/site-packages/pandas/core/indexing.py:1720: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy self._setitem_single_column(loc, value, pi) <ipython-input-24-4de359a3f8b0>:7: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy modelling_df['death_yn']=modelling_df['death_yn'].astype(int)
Now that we've dummified the data, and converted relevant columns into numeric values, we will create a heatmap and correlation matrix.
We can see that death_yn is moderately positively correlated with icu, 80+, and med cond in particular as we ascertained from our previous analysis. We suspect these factors will be weighed heavily in our model.
train_correlation_df=create_correlation_heatmap(df=train_data
,pdf_fn='./2_3_TrainingCorrelation_Heatmap.pdf'
,savefig=True)
Now that we have our target features and training and test data, we need to create the a Linear Model, a Logistic Regression Model, and a RF Model.
I have created functions above which are generalised enough to create each of these models and flag all of the metrics which are required for evaluation and the tasks listed in the exercise. As such, the primary code remaining is to call these models and analyse the results.
Although it is not required, we need to determine what a 'default' model would look like. As the supermajority is that patients do not die from COVID, we want to see what a model would look like which predicts that everybody will live.
all_live_model_result_dictionary=(
create_model(fulldf=modelling_dummy_df
,train_df=train_data
,test_df=test_data
,target_column=target_column
,plot_comp=True
, threshhold_class=0.5
, assess=True
, verbose=True
,mod_type='Live')
)
The Original Vs Predicted Result Is:
death_yn=0 + 0 --------------- --------------- As part of 3 we are meant to predict the data used to train the model (???) The Original Test Vs Predicted Result for Test Set Is:
As required the First Ten Results predicting for the training data:
| Actual | Predicted | PredictionClass | Diff | |
|---|---|---|---|---|
| 8396 | 0 | 0.0 | 0 | 0 |
| 987 | 0 | 0.0 | 0 | 0 |
| 7274 | 0 | 0.0 | 0 | 0 |
| 1000 | 0 | 0.0 | 0 | 0 |
| 4848 | 0 | 0.0 | 0 | 0 |
| 9819 | 0 | 0.0 | 0 | 0 |
| 7109 | 0 | 0.0 | 0 | 0 |
| 3123 | 0 | 0.0 | 0 | 0 |
| 5279 | 1 | 0.0 | 0 | 1 |
| 7752 | 0 | 0.0 | 0 | 0 |
----REPORT----
MAE: 0.035014215172826574
MSE: 0.035014215172826574
RMSE: 0.18712085712936058
R2: -0.03628469530159695
----DETAIL----
Accuracy:
0.9649857848271735
Confusion matrix:
[[6449 0]
[ 234 0]]
Classification report:
{'0': {'precision': 0.9649857848271735, 'recall': 1.0, 'f1-score': 0.9821809320743223, 'support': 6449}, '1': {'precision': 0.0, 'recall': 0.0, 'f1-score': 0.0, 'support': 234}, 'accuracy': 0.9649857848271735, 'macro avg': {'precision': 0.4824928924135867, 'recall': 0.5, 'f1-score': 0.49109046603716117, 'support': 6683}, 'weighted avg': {'precision': 0.9311975649185159, 'recall': 0.9649857848271735, 'f1-score': 0.9477906375800247, 'support': 6683}}
---------------
---------------
----REPORT----
MAE: 0.03420593368237347
MSE: 0.03420593368237347
RMSE: 0.1849484622330596
R2: -0.035417419588001264
----DETAIL----
Accuracy:
0.9657940663176265
Confusion matrix:
[[2767 0]
[ 98 0]]
Classification report:
{'0': {'precision': 0.9657940663176265, 'recall': 1.0, 'f1-score': 0.9825994318181819, 'support': 2767}, '1': {'precision': 0.0, 'recall': 0.0, 'f1-score': 0.0, 'support': 98}, 'accuracy': 0.9657940663176265, 'macro avg': {'precision': 0.48289703315881327, 'recall': 0.5, 'f1-score': 0.49129971590909094, 'support': 2865}, 'weighted avg': {'precision': 0.932758178534336, 'recall': 0.9657940663176265, 'f1-score': 0.9489887008170713, 'support': 2865}}
/opt/miniconda3/envs/comp30830/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1245: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result)) /opt/miniconda3/envs/comp30830/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1245: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result)) /opt/miniconda3/envs/comp30830/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1245: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result)) /opt/miniconda3/envs/comp30830/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1245: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result)) /opt/miniconda3/envs/comp30830/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1245: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result)) /opt/miniconda3/envs/comp30830/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1245: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result))
alllive_class_rep=pd.DataFrame(all_live_model_result_dictionary['ClassificationRep'])
display(alllive_class_rep)
| 0 | 1 | accuracy | macro avg | weighted avg | |
|---|---|---|---|---|---|
| precision | 0.965794 | 0.0 | 0.965794 | 0.482897 | 0.932758 |
| recall | 1.000000 | 0.0 | 0.965794 | 0.500000 | 0.965794 |
| f1-score | 0.982599 | 0.0 | 0.965794 | 0.491300 | 0.948989 |
| support | 2767.000000 | 98.0 | 0.965794 | 2865.000000 | 2865.000000 |
In the cell below, we create a linear regression model and evaluate it using a number of metrics.
- (2.1) On the training set, train a linear regression model to predict the target feature, using only the descriptive features selected in exercise (1) above.
In the function that generates the model, We are creating a Linear Regression model using only the features which were already listed. We train the model using the training set, completing this requirement.
- (2.2) Print the coefficients learned by the model and discuss their role in the model (e.g., interpret the model by analysing each coefficient and how it relates each input feature to the target feature).
In the function that generates the model, we print out the features and the equation which is used. We see:
death_yn =
('age_group_10 - 19 Years' * 0.002270128169016561)
+('age_group_20 - 29 Years' * 0.00021675244052644926)
+('age_group_30 - 39 Years' * -0.0006233744350718495)
+('age_group_40 - 49 Years' * -0.0008676843355245704)
+('age_group_50 - 59 Years' * 0.0025145985771263604)
+('age_group_60 - 69 Years' * 0.03328400774759359)
+('age_group_70 - 79 Years' * 0.09617783507054159)
+('age_group_80+ Years' * 0.2645860166988681)
+('age_group_Unknown' * 0.10438765773694733)
+('hosp_yn_OTH' * -3.3306690738754696e-16)
+('hosp_yn_Unknown' * 0.012189583487253167)
+('hosp_yn_Yes' * 0.1728663400937809)
+('icu_yn_Unknown' * 0.01778246632302322)
+('icu_yn_Yes' * 0.26395229124215197)
+('medcond_yn_Unknown' * -0.005134164961470882)
+('medcond_yn_Yes' * 0.021699890131505622) + (-0.02191179249340093)
From this, we observe that the features with the highest positive correlation are the older age categories (70+), hospitalisation status, and ICU status. People with these elements flagged are more likely to be flagged as likely to die from COVID as the model takes these features heavily into account. As the threshold is .5, and the intercept is -.02, and as the age, hospitalisation, icu, and medical condition features are mutually exclusive (in the sense that you can only fall into one age, one ICU value, one hosp value, one med cond value), some combination of hospitalise Yes, ICU Yes, and Age > 70 would be required for the model to predict death as otherwise the coefficients will not add up significantly to meet the 0.5 threshold.
As this is a linear regression model, the coefficients relates to the unit change in the probability of the outcome given the presence of that feature. The intercept refers to a linear transform of the line (i.e. shifting up or down) and corresponds with the baseline case.
With regards to discussing each of these features, it seems unnecessary and overly verbose to run through the weighting of each feature; the key aspects have been highlighted and the weight and importance is clear.
- (2.3) Print the predicted target feature value for the first 10 training examples. Threshold the predicted target feature value given by the linear regression model at 0.5, to get the predicted class for each example. Print the predicted class for the first 10 examples. Print a few classification evaluation measures computed on the full training set (e.g., Accuracy, Confusion matrix, Precision, Recall, F1) and discuss your findings so far.
In the generation function, I have 'predicted' the training data that was used to fit the model:
In the generated function I have thresholded the value and printed the first ten results.
In the generated function I have computed evaluation metrics.
This is not a 'good' evaluation technique as the model has been trained on that same data; as such these results carry little actual weight or insight into how the model will work on new data. Similarly, Linear Regression is not meant to be used in this manner as it is not a classification model and is not really designed for features like this.
At this stage, we see that the model has a high accuracy for predicting not death, however it also is over-enthusiastic in falsely classifying things into not death. While the results are slightly better than just flagging everybody as not death, the model ultimately at this stage appears to be quite poor. Particularly for COVID, we would prefer a model which is overly aggressive in classifying people as potential deaths to ensure those customers get priority treatment, rather than one which will falsely classify patients at significant risk as likely not to die.
These results are over the training set so not a reliable model to really look at, and these features will be examined more in the actual test data, however as an initial baseline we know that we are likely to receive a model which is accurate, but is accurate because it is poor at predicting death while the supermajority is not death leading to a high accuracy.
The results of this are listed below:
As required the First Ten Results predicting for the training data:
Actual Predicted PredictionClass Diff 8396 0 0.003143 0 0 987 0 0.005441 0 0 7274 0 -0.010131 0 0 1000 0 0.036210 0 0 4848 0 0.036210 0 0 9819 0 0.002058 0 0 7109 0 -0.009263 0 0 3123 0 -0.009887 0 0 5279 1 0.196887 0 1 7752 0 0.002303 0 0
----REPORT---- MAE: 0.03366751458925632 MSE: 0.03366751458925632 RMSE: 0.18348709651977252 R2: 0.003572408363848978 ----DETAIL----
Accuracy: 0.9663324854107437
Confusion matrix: [[6445 4] [ 221 13]]
Classification report: precision recall f1-score support
0 0.97 1.00 0.98 6449
1 0.76 0.06 0.10 234
accuracy 0.97 6683
macro avg 0.87 0.53 0.54 6683 weighted avg 0.96 0.97 0.95 6683
- (2.4) Evaluate the model using classification evaluation measures on the hold-out (30% examples) test set. Compare these results with the evaluation results obtained on the training (70%) dataset. Also compare these results with a cross-validated model (i.e., a new model trained and evaluated using cross-validation on the full dataset). You can use classic k-fold cross-validation or repeated random train/test (70/30) splits. Compare the cross-validation metrics to those obtained on the single train/test split and discuss your findings.
In the generation function, I have evaluated the model on the test set and kept these as the main results. Compared to the previously produced evaluation on the training data, the key shift we observe is that this results in notably worse predictive power for deaths. While the model retains a high accuracy, this is largely originating from the fact that it is much too heavily flagging people as not death. This is likely heavily influenced by the fact that in our data set the total amount of positive death instances is small and the data shifts heavily towards non-death. The act of splitting the full CDC dataset into groups of ten thousand is likely playing a significant impact in hampering the model's ability to correctly classify deaths. This suggests that in testing on a new data set, the model would likely be poorly performant in correctly classifying patients who are at significant risk from COVID.
In the generating function I have provided a cross-validation over 5 folds on the entire data set. We observe the RMSE of the cross validation averaged at 0.15688330546365153 vs. approx. 0.18 for when predicting the training set.
Based on our results, we see the f1 macro average is approx 59% on the test set which ultimately suggests that our prediction is only barely more accurate than guessing, and is ultimately only marginally better than the 0.49 received by flagging everybody as Not Death (particularly when the RMSE of that is similar at 0.18 also). This is slightly higher than on the training set but we observe that the model precision on both metrics has dropped.
Ultimately, based on the creation and analysis of our Linear Regression Model, we can determine that ultimately this is a poor model. While the accuracy is high, this is driven by the model heavily biasing towards flagging people as not death (flagging people who did die as not die incorrectly) when not death occupies the majority of our data set. Linear regression is not meant for feature classification in this manner, so it should be expected that the results of this are not particularly strong. We should ideally not use this model.
linear_model_result_dictionary=(
create_model(fulldf=modelling_dummy_df
,train_df=train_data
,test_df=test_data
,target_column=target_column
,plot_comp=True
, threshhold_class=0.5
, assess=True
, verbose=True
,mod_type='Linear')
)
death_yn =
('age_group_10 - 19 Years' * 0.002270128169016561)
+('age_group_20 - 29 Years' * 0.00021675244052644926)
+('age_group_30 - 39 Years' * -0.0006233744350718495)
+('age_group_40 - 49 Years' * -0.0008676843355245704)
+('age_group_50 - 59 Years' * 0.0025145985771263604)
+('age_group_60 - 69 Years' * 0.03328400774759359)
+('age_group_70 - 79 Years' * 0.09617783507054159)
+('age_group_80+ Years' * 0.2645860166988681)
+('age_group_Unknown' * 0.10438765773694733)
+('hosp_yn_OTH' * -3.3306690738754696e-16)
+('hosp_yn_Unknown' * 0.012189583487253167)
+('hosp_yn_Yes' * 0.1728663400937809)
+('icu_yn_Unknown' * 0.01778246632302322)
+('icu_yn_Yes' * 0.26395229124215197)
+('medcond_yn_Unknown' * -0.005134164961470882)
+('medcond_yn_Yes' * 0.021699890131505622) + (-0.02191179249340093)
---------------
---------------
As part of 3 we are meant to predict the data used to train the model (???)
The Original Test Vs Predicted Result for Test Set Is:
As required the First Ten Results predicting for the training data:
| Actual | Predicted | PredictionClass | Diff | |
|---|---|---|---|---|
| 8396 | 0 | 0.003143 | 0 | 0 |
| 987 | 0 | 0.005441 | 0 | 0 |
| 7274 | 0 | -0.010131 | 0 | 0 |
| 1000 | 0 | 0.036210 | 0 | 0 |
| 4848 | 0 | 0.036210 | 0 | 0 |
| 9819 | 0 | 0.002058 | 0 | 0 |
| 7109 | 0 | -0.009263 | 0 | 0 |
| 3123 | 0 | -0.009887 | 0 | 0 |
| 5279 | 1 | 0.196887 | 0 | 1 |
| 7752 | 0 | 0.002303 | 0 | 0 |
----REPORT----
MAE: 0.03366751458925632
MSE: 0.03366751458925632
RMSE: 0.18348709651977252
R2: 0.003572408363848978
----DETAIL----
Accuracy:
0.9663324854107437
Confusion matrix:
[[6445 4]
[ 221 13]]
Classification report:
{'0': {'precision': 0.9668466846684668, 'recall': 0.9993797487982633, 'f1-score': 0.9828440716736561, 'support': 6449}, '1': {'precision': 0.7647058823529411, 'recall': 0.05555555555555555, 'f1-score': 0.10358565737051792, 'support': 234}, 'accuracy': 0.9663324854107437, 'macro avg': {'precision': 0.8657762835107039, 'recall': 0.5274676521769094, 'f1-score': 0.543214864522087, 'support': 6683}, 'weighted avg': {'precision': 0.9597688831209831, 'recall': 0.9663324854107437, 'f1-score': 0.9520575283627276, 'support': 6683}}
---------------
---------------
---------------
---------------
PROPER RESULT FROM TEST DATA:
The Original Vs Predicted Result Is:
----REPORT----
MAE: 0.032460732984293195
MSE: 0.032460732984293195
RMSE: 0.1801686237508995
R2: 0.017409999778733587
----DETAIL----
Accuracy:
0.9675392670157068
Confusion matrix:
[[2760 7]
[ 86 12]]
Classification report:
{'0': {'precision': 0.9697821503865074, 'recall': 0.9974701843151428, 'f1-score': 0.9834313201496525, 'support': 2767}, '1': {'precision': 0.631578947368421, 'recall': 0.12244897959183673, 'f1-score': 0.20512820512820512, 'support': 98}, 'accuracy': 0.9675392670157068, 'macro avg': {'precision': 0.8006805488774642, 'recall': 0.5599595819534897, 'f1-score': 0.5942797626389288, 'support': 2865}, 'weighted avg': {'precision': 0.9582135940529045, 'recall': 0.9675392670157068, 'f1-score': 0.9568087354124443, 'support': 2865}}
[0.02223966 0.02428779 0.02660176 0.02447963 0.02556203]
Avg RMSE score over 5 folds:
0.15688330546365153
Stddev RMSE score over 5 folds:
0.004669399838315644
linreg_class_rep=pd.DataFrame(linear_model_result_dictionary['ClassificationRep'])
display(linreg_class_rep)
print("The first 10 results predicted on the test set:")
display(linear_model_result_dictionary['Actual vs Prediction'].head(10))
| 0 | 1 | accuracy | macro avg | weighted avg | |
|---|---|---|---|---|---|
| precision | 0.969782 | 0.631579 | 0.967539 | 0.800681 | 0.958214 |
| recall | 0.997470 | 0.122449 | 0.967539 | 0.559960 | 0.967539 |
| f1-score | 0.983431 | 0.205128 | 0.967539 | 0.594280 | 0.956809 |
| support | 2767.000000 | 98.000000 | 0.967539 | 2865.000000 | 2865.000000 |
The first 10 results predicted on the test set:
| Actual | Predicted | PredictionClass | Diff | |
|---|---|---|---|---|
| 6757 | 0 | 0.036210 | 0 | 0 |
| 5603 | 0 | 0.002058 | 0 | 0 |
| 2081 | 0 | -0.006749 | 0 | 0 |
| 476 | 0 | -0.009887 | 0 | 0 |
| 6203 | 0 | -0.021912 | 0 | 0 |
| 1671 | 0 | -0.009263 | 0 | 0 |
| 809 | 0 | -0.010131 | 0 | 0 |
| 9885 | 0 | 0.005441 | 0 | 0 |
| 956 | 0 | 0.003143 | 0 | 0 |
| 6200 | 0 | -0.009047 | 0 | 0 |
We explicitly look at our test set's actual versus prediction death predictions to see the true positive for both metrics.
print("We got {:.2f}% correct".format(100*len(
linear_model_result_dictionary['Actual vs Prediction'][linear_model_result_dictionary['Actual vs Prediction']['Diff']==0])/len(
linear_model_result_dictionary['Actual vs Prediction'])))
print("We correctly predicted {} out of {} deaths.".format(
len(linear_model_result_dictionary['Actual vs Prediction'][(linear_model_result_dictionary['Actual vs Prediction']['Diff']==0)&(linear_model_result_dictionary['Actual vs Prediction']['Actual']==1)]),len(linear_model_result_dictionary['Actual vs Prediction'][(linear_model_result_dictionary['Actual vs Prediction']['Actual']==1)])))
print("We correctly predicted {} out of {} lives.".format(
len(linear_model_result_dictionary['Actual vs Prediction'][(linear_model_result_dictionary['Actual vs Prediction']['Diff']==0)&(linear_model_result_dictionary['Actual vs Prediction']['Actual']==0)]),len(linear_model_result_dictionary['Actual vs Prediction'][(linear_model_result_dictionary['Actual vs Prediction']['Actual']==0)])))
We got 96.75% correct We correctly predicted 12 out of 98 deaths. We correctly predicted 2760 out of 2767 lives.
- (3.1) On the training set, train a logistic regression model to predict the target feature, using the descriptive features selected in exercise (1) above.
As in linear regression, this is done.
- (3.2) Print the coefficients learned by the model and discuss their role in the model (e.g., interpret the model).
death_yn = logistic( ('age_group_10 - 19 Years' -1.3954739647453331) +('age_group_20 - 29 Years' -1.7477619209329482) +('age_group_30 - 39 Years' -1.437896043371963) +('age_group_40 - 49 Years' -0.5467412391146604) +('age_group_50 - 59 Years' 0.08372144064151235) +('age_group_60 - 69 Years' 1.385612326921969) +('age_group_70 - 79 Years' 2.1476304546024774) +('age_group_80+ Years' 3.2574857584006645) +('age_group_Unknown' 0.8325318957143447) +('hosp_yn_OTH' 0.0) +('hosp_yn_Unknown' 0.8415435273812718) +('hosp_yn_Yes' 2.3672678007865713) +('icu_yn_Unknown' 0.4915252721992774) +('icu_yn_Yes' 1.9132227077415447) +('medcond_yn_Unknown' 0.3089176968434786) +('medcond_yn_Yes' 0.9468181138631563)
where logistic(x) is the standard log transform.
I.e. for F={(age features, age weighting),(hosp features, hosp weighting),(icu features, icu weighting), (med_cond features, med_con weighting)} we have $ \mathrm{P}(death\_yn=1|F)=\sum\limits_{f=(f_1,f_2) \in F} \frac{e ^ {-({-6.12674106 + f_2 * f_1})}}{1 +e ^ {-({-6.12674106 + f_2 * f_1})}}$
The coefficient therefore represents the log odd change and hence smaller changes have a more significant impact in the result particularly as the coefficient increases. To this end, we see that again the older age groups are significantly considered by the model, and again it is sensitive to ICU and Hospitalisation. The intercept refers to a shifting of the curve and dictates the base case.
- (3.3) Print the predicted target feature value for the first 10 training examples. Print the predicted class for the first 10 examples. Print a few classification evaluation measures computed on the full training set (e.g., Accuracy, Confusion matrix, Precision, Recall, F1) and discuss your findings so far.
Again this is taken care of in the function. The first ten rows were printed and classification measures were printed.
Looking at the evaluation of the data on the training set, we see that compared to the linear regression model, the logistic regression model is doing a better job at predicting the death_yn feature. While there is an increase in patients being flagged as potentially dying, the number of true positives were more accurately determined over the training set, and macro average is significantly improved from where it had been in the previous linear regression model. The model is still underflagging deaths, which is a problem given that people at risk need the care given and must be identified in the context of healthcare, but it is an improvement on what had been achieved using a very simple regression model.
As required the First Ten Results predicting for the training data: Actual Predicted PredictionClass Diff 8396 0 0 0 0 987 0 0 0 0 7274 0 0 0 0 1000 0 0 0 0 4848 0 0 0 0 9819 0 0 0 0 7109 0 0 0 0 3123 0 0 0 0 5279 1 0 0 1 7752 0 0 0 0 ----REPORT---- MAE: 0.03321861439473291 MSE: 0.03321861439473291 RMSE: 0.1822597443066705 R2: 0.01685810958566436 ----DETAIL----
Accuracy: 0.966781385605267
Confusion matrix: [[6398 51] [ 171 63]]
Classification report: precision recall f1-score support
0 0.97 0.99 0.98 6449
1 0.55 0.27 0.36 234
accuracy 0.97 6683
macro avg 0.76 0.63 0.67 6683 weighted avg 0.96 0.97 0.96 6683
- (3.4) Evaluate the model using classification evaluation measures on the hold-out (30% examples) test set. Compare these results with the evaluation results obtained when using the training (70%) dataset for evaluation. Also compare these results with a cross-validated model (i.e., a new model trained and evaluated using cross-validation on the full dataset). You can use classic k-fold cross-validation or repeated train/test (70/30) splits. Compare the cross-validation metrics to those obtained on the single train/test split and discuss your findings.
In the generating function I have printed these results. We can observe that this model is performing more strongly than our simple linear regression model and is achieving a macro f1-score on the training data of 0.70. This is slightly strong than what is achieved over the training set and is a good sign that the model is generalised and not overly fitted to the training data. In fact, almost all metrics on the test data is higher than what was observed for the training data. This might be an area which would warrant further investigation to understand why this is so and ensure that the comparatively strong results are a consequence of good generalisation. We ee the RMSE is .175 while over 5-fold validation this is .18 with the model being comparatively consistent. The model correctly identified 31 of the patients who had died of COVID which is a much stronger result than what was seen in the linear regression example. Particularly on account of the context in a healthcare setting, it is much more important that the model correctly flags patients who will die/are at risk than it is to correctly classify patients who are healthy, so long as the false positive rate isn't so high ass to overburden the healthcare system.
logistic_model_result_dictionary=(
create_model(fulldf=modelling_dummy_df
,train_df=train_data
,test_df=test_data
,target_column=target_column
,plot_comp=True
, threshhold_class=0.5
, assess=True
, verbose=True
,mod_type='Logistic')
)
death_yn = logistic(
('age_group_10 - 19 Years' * -1.3954739647453331)
+('age_group_20 - 29 Years' * -1.7477619209329482)
+('age_group_30 - 39 Years' * -1.437896043371963)
+('age_group_40 - 49 Years' * -0.5467412391146604)
+('age_group_50 - 59 Years' * 0.08372144064151235)
+('age_group_60 - 69 Years' * 1.385612326921969)
+('age_group_70 - 79 Years' * 2.1476304546024774)
+('age_group_80+ Years' * 3.2574857584006645)
+('age_group_Unknown' * 0.8325318957143447)
+('hosp_yn_OTH' * 0.0)
+('hosp_yn_Unknown' * 0.8415435273812718)
+('hosp_yn_Yes' * 2.3672678007865713)
+('icu_yn_Unknown' * 0.4915252721992774)
+('icu_yn_Yes' * 1.9132227077415447)
+('medcond_yn_Unknown' * 0.3089176968434786)
+('medcond_yn_Yes' * 0.9468181138631563) + ([-6.12674106]))
---------------
---------------
As part of 3 we are meant to predict the data used to train the model (???)
The Original Test Vs Predicted Result for Test Set Is:
As required the First Ten Results predicting for the training data:
| Actual | Predicted | PredictionClass | Diff | |
|---|---|---|---|---|
| 8396 | 0 | 0 | 0 | 0 |
| 987 | 0 | 0 | 0 | 0 |
| 7274 | 0 | 0 | 0 | 0 |
| 1000 | 0 | 0 | 0 | 0 |
| 4848 | 0 | 0 | 0 | 0 |
| 9819 | 0 | 0 | 0 | 0 |
| 7109 | 0 | 0 | 0 | 0 |
| 3123 | 0 | 0 | 0 | 0 |
| 5279 | 1 | 0 | 0 | 1 |
| 7752 | 0 | 0 | 0 | 0 |
----REPORT----
MAE: 0.03321861439473291
MSE: 0.03321861439473291
RMSE: 0.1822597443066705
R2: 0.01685810958566436
----DETAIL----
Accuracy:
0.966781385605267
Confusion matrix:
[[6398 51]
[ 171 63]]
Classification report:
{'0': {'precision': 0.9739686405845639, 'recall': 0.992091797177857, 'f1-score': 0.9829466891995698, 'support': 6449}, '1': {'precision': 0.5526315789473685, 'recall': 0.2692307692307692, 'f1-score': 0.3620689655172414, 'support': 234}, 'accuracy': 0.966781385605267, 'macro avg': {'precision': 0.7633001097659662, 'recall': 0.6306612832043131, 'f1-score': 0.6725078273584056, 'support': 6683}, 'weighted avg': {'precision': 0.9592158540481126, 'recall': 0.966781385605267, 'f1-score': 0.961207142986542, 'support': 6683}}
---------------
---------------
---------------
---------------
PROPER RESULT FROM TEST DATA:
Log: The Original Vs Predicted Result Is:
----REPORT----
MAE: 0.030715532286212915
MSE: 0.030715532286212915
RMSE: 0.17525847279436424
R2: 0.07023741914546833
----DETAIL----
Accuracy:
0.9692844677137871
Confusion matrix:
[[2746 21]
[ 67 31]]
Classification report:
{'0': {'precision': 0.9761820120867402, 'recall': 0.9924105529454282, 'f1-score': 0.9842293906810036, 'support': 2767}, '1': {'precision': 0.5961538461538461, 'recall': 0.3163265306122449, 'f1-score': 0.41333333333333333, 'support': 98}, 'accuracy': 0.9692844677137871, 'macro avg': {'precision': 0.7861679291202932, 'recall': 0.6543685417788365, 'f1-score': 0.6987813620071684, 'support': 2865}, 'weighted avg': {'precision': 0.9631827938454056, 'recall': 0.9692844677137871, 'f1-score': 0.9647013580038407, 'support': 2865}}
[0.03246073 0.02984293 0.03403141 0.03352541 0.03404924]
Avg RMSE score over 5 folds:
0.18100391004069125
Stddev RMSE score over 5 folds:
0.004419223815831302
logistic_class_rep=pd.DataFrame(logistic_model_result_dictionary['ClassificationRep'])
display(logistic_class_rep)
print("The first 10 results predicted on the test set:")
display(logistic_model_result_dictionary['Actual vs Prediction'].head(10))
| 0 | 1 | accuracy | macro avg | weighted avg | |
|---|---|---|---|---|---|
| precision | 0.976182 | 0.596154 | 0.969284 | 0.786168 | 0.963183 |
| recall | 0.992411 | 0.316327 | 0.969284 | 0.654369 | 0.969284 |
| f1-score | 0.984229 | 0.413333 | 0.969284 | 0.698781 | 0.964701 |
| support | 2767.000000 | 98.000000 | 0.969284 | 2865.000000 | 2865.000000 |
The first 10 results predicted on the test set:
| Actual | Predicted | PredictionClass | Diff | |
|---|---|---|---|---|
| 6757 | 0 | 0 | 0 | 0 |
| 5603 | 0 | 0 | 0 | 0 |
| 2081 | 0 | 0 | 0 | 0 |
| 476 | 0 | 0 | 0 | 0 |
| 6203 | 0 | 0 | 0 | 0 |
| 1671 | 0 | 0 | 0 | 0 |
| 809 | 0 | 0 | 0 | 0 |
| 9885 | 0 | 0 | 0 | 0 |
| 956 | 0 | 0 | 0 | 0 |
| 6200 | 0 | 0 | 0 | 0 |
We explicitly look at our test set's actual versus prediction death predictions to see the true positive for both metrics.
print("We got {:.2f}% correct".format(100*len(
logistic_model_result_dictionary['Actual vs Prediction'][logistic_model_result_dictionary['Actual vs Prediction']['Diff']==0])/len(
logistic_model_result_dictionary['Actual vs Prediction'])))
print("We correctly predicted {} out of {} true deaths.".format(len(logistic_model_result_dictionary['Actual vs Prediction'][(logistic_model_result_dictionary['Actual vs Prediction']['Diff']==0)&(logistic_model_result_dictionary['Actual vs Prediction']['Actual']==1)]),len(logistic_model_result_dictionary['Actual vs Prediction'][(logistic_model_result_dictionary['Actual vs Prediction']['Actual']==1)])))
print("We correctly predicted {} out of {} true lives.".format(len(logistic_model_result_dictionary['Actual vs Prediction'][(logistic_model_result_dictionary['Actual vs Prediction']['Diff']==0)&(logistic_model_result_dictionary['Actual vs Prediction']['Actual']==0)]),len(logistic_model_result_dictionary['Actual vs Prediction'][(logistic_model_result_dictionary['Actual vs Prediction']['Actual']==0)])))
We got 96.93% correct We correctly predicted 31 out of 98 true deaths. We correctly predicted 2746 out of 2767 true lives.
- (3.1) On the training set, train a random forest model to predict the target feature, using the descriptive features selected in exercise (1) above.
This is done.
- (3.2) Can you interpret the random forest model? Discuss any knowledge you can gain in regard of the working of this model.
As in other models, the generating function for the model has plotted the feature importance. We can see that the key features which the model is using to weigh the results are if you were admitted to the ICU, if you are in the over 80 age group,if you were hospitalised, if you are in the over 70 age group, and if you previously have medical conditions. (Being over 80 or hospitalised weighted with almost three times the weight of the next highest weighted feature).
- (3.3) Print the predicted target feature value for the first 10 training examples. Print the predicted class for the first 10 examples. Print a few classification evaluation measures computed on the full training set (e.g., Accuracy, Confusion matrix, Precision, Recall, F1) and discuss your findings so far.
This is done.
Based on the training data results, we see that the macro avg f1 score is .72 and it has a high accuracy of 97%. This has been the most strongly performing model examined so far over the sample data which I have, but has a comparable performance to that of the Logistic Regression Model over the training data set. Similar to the other models which have been looked at, the biggest challenge for the model is in accurately classifying deaths, but this model has done better than all previous models in doing so when looking at the training sets.
- (3.4) Evaluate the model using classification evaluation measures on the hold-out (30% examples) test set. Compare these results with the evaluation results obtained when using the training (70%) dataset for evaluation. Also compare these results with a cross-validated model (i.e., a new model trained and evaluated using cross-validation on the full dataset). You can use classic k-fold cross-validation or repeated train/test (70/30) splits. Compare the cross-validation metrics to those obtained on the single train/test split and to the out-of-sample error and discuss your findings.
In the generating function I have printed these results. We can observe that this model is performing more strongly than our simple linear regression model, our everybody lives model, and our logistic regression model and is achieving a macro f1-score on the training data of 0.71 and has an ootb accuracy of 97% with a 5-fold ootb accuracy of 98.3%. While the training data is slightly worse than on the test data, the difference is quite negligible suggesting the model is well-generalised and not overfit. RMSE is lower than the Logistic model and the death accuracy is slightly higher. Based on these results, I would recommend that the RF model is what is used in a production setting.
rf_model_result_dictionary=(
create_model(fulldf=modelling_dummy_df
,train_df=train_data
,test_df=test_data
,target_column=target_column
,plot_comp=True
, threshhold_class=0.5
, assess=True
, verbose=True
,mod_type='Random Forest')
)
--------------- --------------- As part of 3 we are meant to predict the data used to train the model (???) The Original Test Vs Predicted Result for Test Set Is:
As required the First Ten Results predicting for the training data:
| Actual | Predicted | PredictionClass | Diff | |
|---|---|---|---|---|
| 8396 | 0 | 0 | 0 | 0 |
| 987 | 0 | 0 | 0 | 0 |
| 7274 | 0 | 0 | 0 | 0 |
| 1000 | 0 | 0 | 0 | 0 |
| 4848 | 0 | 0 | 0 | 0 |
| 9819 | 0 | 0 | 0 | 0 |
| 7109 | 0 | 0 | 0 | 0 |
| 3123 | 0 | 0 | 0 | 0 |
| 5279 | 1 | 0 | 0 | 1 |
| 7752 | 0 | 0 | 0 | 0 |
----REPORT----
MAE: 0.030824480023941343
MSE: 0.030824480023941343
RMSE: 0.17556901783612433
R2: 0.08771518276867951
----DETAIL----
Accuracy:
0.9691755199760587
Confusion matrix:
[[6393 56]
[ 150 84]]
Classification report:
{'0': {'precision': 0.9770747363594682, 'recall': 0.9913164831756861, 'f1-score': 0.9841440886699507, 'support': 6449}, '1': {'precision': 0.6, 'recall': 0.358974358974359, 'f1-score': 0.44919786096256686, 'support': 234}, 'accuracy': 0.9691755199760587, 'macro avg': {'precision': 0.7885373681797341, 'recall': 0.6751454210750225, 'f1-score': 0.7166709748162587, 'support': 6683}, 'weighted avg': {'precision': 0.9638717604043409, 'recall': 0.9691755199760587, 'f1-score': 0.9654133663471126, 'support': 6683}}
---------------
---------------
---------------
---------------
PROPER RESULT FROM TEST DATA:
Random Forest: The Original Vs Predicted Result Is:
Random Forest: Feature Importance:
<Figure size 3600x1440 with 0 Axes>
----REPORT----
MAE: 0.030715532286212915
MSE: 0.030715532286212915
RMSE: 0.17525847279436424
R2: 0.07023741914546833
----DETAIL----
Accuracy:
0.9692844677137871
Confusion matrix:
[[2744 23]
[ 65 33]]
Classification report:
{'0': {'precision': 0.9768600925596298, 'recall': 0.9916877484640405, 'f1-score': 0.9842180774748924, 'support': 2767}, '1': {'precision': 0.5892857142857143, 'recall': 0.336734693877551, 'f1-score': 0.4285714285714286, 'support': 98}, 'accuracy': 0.9692844677137871, 'macro avg': {'precision': 0.783072903422672, 'recall': 0.6642112211707958, 'f1-score': 0.7063947530231605, 'support': 2865}, 'weighted avg': {'precision': 0.963602749079405, 'recall': 0.9692844677137871, 'f1-score': 0.9652116650516676, 'support': 2865}}
[0.96806283 0.96753927 0.96910995 0.96490309 0.96280775]
Avg Accuracy score over 5 folds:
0.9830987765527361
Stddev Accuracy score over 5 folds:
0.0011715767902164363
rf_class_rep=pd.DataFrame(rf_model_result_dictionary['ClassificationRep'])
display(rf_class_rep)
print("The first 10 results predicted on the test set:")
display(rf_model_result_dictionary['Actual vs Prediction'].head(10))
| 0 | 1 | accuracy | macro avg | weighted avg | |
|---|---|---|---|---|---|
| precision | 0.976860 | 0.589286 | 0.969284 | 0.783073 | 0.963603 |
| recall | 0.991688 | 0.336735 | 0.969284 | 0.664211 | 0.969284 |
| f1-score | 0.984218 | 0.428571 | 0.969284 | 0.706395 | 0.965212 |
| support | 2767.000000 | 98.000000 | 0.969284 | 2865.000000 | 2865.000000 |
The first 10 results predicted on the test set:
| Actual | Predicted | PredictionClass | Diff | |
|---|---|---|---|---|
| 6757 | 0 | 0 | 0 | 0 |
| 5603 | 0 | 0 | 0 | 0 |
| 2081 | 0 | 0 | 0 | 0 |
| 476 | 0 | 0 | 0 | 0 |
| 6203 | 0 | 0 | 0 | 0 |
| 1671 | 0 | 0 | 0 | 0 |
| 809 | 0 | 0 | 0 | 0 |
| 9885 | 0 | 0 | 0 | 0 |
| 956 | 0 | 0 | 0 | 0 |
| 6200 | 0 | 0 | 0 | 0 |
print("We got {:.2f}% correct".format(100*len(
rf_model_result_dictionary['Actual vs Prediction'][rf_model_result_dictionary['Actual vs Prediction']['Diff']==0])/len(
rf_model_result_dictionary['Actual vs Prediction'])))
print("We correctly predicted {} out of {} deaths.".format(
len(rf_model_result_dictionary['Actual vs Prediction'][(rf_model_result_dictionary['Actual vs Prediction']['Diff']==0)&(rf_model_result_dictionary['Actual vs Prediction']['Actual']==1)]),len(rf_model_result_dictionary['Actual vs Prediction'][(rf_model_result_dictionary['Actual vs Prediction']['Actual']==1)])))
print("We correctly predicted {} out of {} lives.".format(
len(rf_model_result_dictionary['Actual vs Prediction'][(rf_model_result_dictionary['Actual vs Prediction']['Diff']==0)&(rf_model_result_dictionary['Actual vs Prediction']['Actual']==0)]),len(rf_model_result_dictionary['Actual vs Prediction'][(rf_model_result_dictionary['Actual vs Prediction']['Actual']==0)])))
display()
We got 96.93% correct We correctly predicted 33 out of 98 deaths. We correctly predicted 2744 out of 2767 lives.
XGBoost models are Kaggle-winning extreme gradient boosting models adapted from AdaBoosting which is highly used in industry due to fast classification, powerful models, and a strong degree of parameters to tune.
I start by building an XGBoost model which has the hyperparameters tuned via GridSearch to assess what an alternative model may look like and compare to what was built.
#XGBoost has an incredibly annoying warning system. Two warnings get repeatedly generated with XGBClassofoer, first a warning that there was a change in how the binary logistic scoring works, and secondly a user warning re: use_;label_encoder
#Suppress these
import warnings
warnings.simplefilter(action='ignore', category=UserWarning)
xgboost_model_result_dictionary=(
create_model(fulldf=modelling_dummy_df
,train_df=train_data
,test_df=test_data
,target_column=target_column
,plot_comp=True
, threshhold_class=0.5
, assess=True
, verbose=True
,mod_type='XGBoost')
)
Model Training Score: 96.88762531797097% --------------- --------------- As part of 3 we are meant to predict the data used to train the model (???) The Original Test Vs Predicted Result for Test Set Is:
As required the First Ten Results predicting for the training data:
| Actual | Predicted | PredictionClass | Diff | |
|---|---|---|---|---|
| 8396 | 0 | 0 | 0 | 0 |
| 987 | 0 | 0 | 0 | 0 |
| 7274 | 0 | 0 | 0 | 0 |
| 1000 | 0 | 0 | 0 | 0 |
| 4848 | 0 | 0 | 0 | 0 |
| 9819 | 0 | 0 | 0 | 0 |
| 7109 | 0 | 0 | 0 | 0 |
| 3123 | 0 | 0 | 0 | 0 |
| 5279 | 1 | 0 | 0 | 1 |
| 7752 | 0 | 0 | 0 | 0 |
----REPORT----
MAE: 0.031123746820290288
MSE: 0.031123746820290288
RMSE: 0.17641923597014667
R2: 0.07885804862080259
----DETAIL----
Accuracy:
0.9688762531797097
Confusion matrix:
[[6393 56]
[ 152 82]]
Classification report:
{'0': {'precision': 0.9767761650114591, 'recall': 0.9913164831756861, 'f1-score': 0.9839926119747576, 'support': 6449}, '1': {'precision': 0.5942028985507246, 'recall': 0.3504273504273504, 'f1-score': 0.4408602150537634, 'support': 234}, 'accuracy': 0.9688762531797097, 'macro avg': {'precision': 0.7854895317810919, 'recall': 0.6708719168015183, 'f1-score': 0.7124264135142605, 'support': 6683}, 'weighted avg': {'precision': 0.9633806623402319, 'recall': 0.9688762531797097, 'f1-score': 0.9649752573616328, 'support': 6683}}
---------------
---------------
Model Accuracy: [97.27748691 96.64921466 96.70157068 96.49030906 96.33315872]
The Original Vs Predicted Result Is:
----REPORT----
MAE: 0.030366492146596858
MSE: 0.030366492146596858
RMSE: 0.17425984088882
R2: 0.08080290301881532
----DETAIL----
Accuracy:
0.9696335078534032
Confusion matrix:
[[2745 22]
[ 65 33]]
Classification report:
{'0': {'precision': 0.9768683274021353, 'recall': 0.9920491507047343, 'f1-score': 0.984400215169446, 'support': 2767}, '1': {'precision': 0.6, 'recall': 0.336734693877551, 'f1-score': 0.43137254901960786, 'support': 98}, 'accuracy': 0.9696335078534032, 'macro avg': {'precision': 0.7884341637010677, 'recall': 0.6643919222911426, 'f1-score': 0.7078863820945269, 'support': 2865}, 'weighted avg': {'precision': 0.9639771943880308, 'recall': 0.9696335078534032, 'f1-score': 0.9654833874966069, 'support': 2865}}
Importance by Booster Plot
Importance by Weight:
The Tree Is:
xgb_class_rep=pd.DataFrame(xgboost_model_result_dictionary['ClassificationRep'])
display(xgb_class_rep)
print("The first 10 results predicted on the test set:")
display(xgboost_model_result_dictionary['Actual vs Prediction'].head(10))
| 0 | 1 | accuracy | macro avg | weighted avg | |
|---|---|---|---|---|---|
| precision | 0.976868 | 0.600000 | 0.969634 | 0.788434 | 0.963977 |
| recall | 0.992049 | 0.336735 | 0.969634 | 0.664392 | 0.969634 |
| f1-score | 0.984400 | 0.431373 | 0.969634 | 0.707886 | 0.965483 |
| support | 2767.000000 | 98.000000 | 0.969634 | 2865.000000 | 2865.000000 |
The first 10 results predicted on the test set:
| Actual | Predicted | PredictionClass | Diff | |
|---|---|---|---|---|
| 6757 | 0 | 0 | 0 | 0 |
| 5603 | 0 | 0 | 0 | 0 |
| 2081 | 0 | 0 | 0 | 0 |
| 476 | 0 | 0 | 0 | 0 |
| 6203 | 0 | 0 | 0 | 0 |
| 1671 | 0 | 0 | 0 | 0 |
| 809 | 0 | 0 | 0 | 0 |
| 9885 | 0 | 0 | 0 | 0 |
| 956 | 0 | 0 | 0 | 0 |
| 6200 | 0 | 0 | 0 | 0 |
- (5.1) Which model of the ones trained above performs better at predicting the target feature?
-Is it more accurate than a simple model that always predicts the majority class (i.e., if 'no' is the majority class in your dataset, the simple model always predicts 'no' for the target feature)? Justify your answers.
In the cells below I have created some comparisons of each of the models which were built. I have also compared these models against one another while describing the performance individually in the previous sections. Each of the models performaned have better performance than the simple everybody lives performance largely driven by the fact that while the everybody lives model may have in some instances a higher accuracy, this is driven by the majority class being that patients live, yet it fails to predict death (due to the nature of the model). In the context of this assignment, that is an incredibly poor aspect of the model's performance as it means that patients who are at risk of dying from COVID would not be appropriately flagged and hence could receive a poor case outcome because doctor's are not aware of the underpinning workings of the model.
The model which has the best balance of true positives, false positives, and balancing all aspects of the model is the Random Forest Model which slightly eeks out ahead of the Regression (Logistic) model.Overall, the XGBoost model which was a non-mandatory aspect of the assignment is the best performing model, however its performance is overall comparable to the Random Forest model (this is not too surprising as the default operation of XGBoost is to use GBTrees in creating the model). The one downside for the XGBoost model is that the training time is more significant than other models due to the presence of GridSearch to try and pinpoint the optimal hyperparameters.
We see in the graphs below that over the test dataset, the performance of Logistic Regression, Random Forest, and XGBoost are all quite comparable with relatively minor performance differences between the three. I would recommend comparing and adding AutoML as an additional model, as it is highly performant and encorporates ensembling and hyperparameter optimisation and is well-regarded as an easily implementable out of the box ML kit from Google's team, but this is out of scope of the assignment.
result_df=compare_models(model_report_list=[alllive_class_rep.T,linreg_class_rep.T,logistic_class_rep.T,rf_class_rep.T,xgb_class_rep.T]
, keys=["Simple", "Linear", "Logistic","RandomForest","XGBoost"])
Comparison Dataframe is:
| Model | Type | precision | recall | f1-score | support | |
|---|---|---|---|---|---|---|
| 0 | Simple | 0 | 0.965794 | 1.000000 | 0.982599 | 2767.000000 |
| 1 | Simple | 1 | 0.000000 | 0.000000 | 0.000000 | 98.000000 |
| 2 | Simple | accuracy | 0.965794 | 0.965794 | 0.965794 | 0.965794 |
| 3 | Simple | macro avg | 0.482897 | 0.500000 | 0.491300 | 2865.000000 |
| 4 | Simple | weighted avg | 0.932758 | 0.965794 | 0.948989 | 2865.000000 |
| 5 | Linear | 0 | 0.969782 | 0.997470 | 0.983431 | 2767.000000 |
| 6 | Linear | 1 | 0.631579 | 0.122449 | 0.205128 | 98.000000 |
| 7 | Linear | accuracy | 0.967539 | 0.967539 | 0.967539 | 0.967539 |
| 8 | Linear | macro avg | 0.800681 | 0.559960 | 0.594280 | 2865.000000 |
| 9 | Linear | weighted avg | 0.958214 | 0.967539 | 0.956809 | 2865.000000 |
| 10 | Logistic | 0 | 0.976182 | 0.992411 | 0.984229 | 2767.000000 |
| 11 | Logistic | 1 | 0.596154 | 0.316327 | 0.413333 | 98.000000 |
| 12 | Logistic | accuracy | 0.969284 | 0.969284 | 0.969284 | 0.969284 |
| 13 | Logistic | macro avg | 0.786168 | 0.654369 | 0.698781 | 2865.000000 |
| 14 | Logistic | weighted avg | 0.963183 | 0.969284 | 0.964701 | 2865.000000 |
| 15 | RandomForest | 0 | 0.976860 | 0.991688 | 0.984218 | 2767.000000 |
| 16 | RandomForest | 1 | 0.589286 | 0.336735 | 0.428571 | 98.000000 |
| 17 | RandomForest | accuracy | 0.969284 | 0.969284 | 0.969284 | 0.969284 |
| 18 | RandomForest | macro avg | 0.783073 | 0.664211 | 0.706395 | 2865.000000 |
| 19 | RandomForest | weighted avg | 0.963603 | 0.969284 | 0.965212 | 2865.000000 |
| 20 | XGBoost | 0 | 0.976868 | 0.992049 | 0.984400 | 2767.000000 |
| 21 | XGBoost | 1 | 0.600000 | 0.336735 | 0.431373 | 98.000000 |
| 22 | XGBoost | accuracy | 0.969634 | 0.969634 | 0.969634 | 0.969634 |
| 23 | XGBoost | macro avg | 0.788434 | 0.664392 | 0.707886 | 2865.000000 |
| 24 | XGBoost | weighted avg | 0.963977 | 0.969634 | 0.965483 | 2865.000000 |
display(result_df[result_df['Type']=='0'])
display(result_df[result_df['Type']=='1'])
display(result_df[result_df['Type']=='accuracy'])
display(result_df[result_df['Type']=='weighted avg'])
display(result_df[result_df['Type']=='macro avg'])
| Model | Type | precision | recall | f1-score | support | |
|---|---|---|---|---|---|---|
| 0 | Simple | 0 | 0.965794 | 1.000000 | 0.982599 | 2767.0 |
| 5 | Linear | 0 | 0.969782 | 0.997470 | 0.983431 | 2767.0 |
| 10 | Logistic | 0 | 0.976182 | 0.992411 | 0.984229 | 2767.0 |
| 15 | RandomForest | 0 | 0.976860 | 0.991688 | 0.984218 | 2767.0 |
| 20 | XGBoost | 0 | 0.976868 | 0.992049 | 0.984400 | 2767.0 |
| Model | Type | precision | recall | f1-score | support | |
|---|---|---|---|---|---|---|
| 1 | Simple | 1 | 0.000000 | 0.000000 | 0.000000 | 98.0 |
| 6 | Linear | 1 | 0.631579 | 0.122449 | 0.205128 | 98.0 |
| 11 | Logistic | 1 | 0.596154 | 0.316327 | 0.413333 | 98.0 |
| 16 | RandomForest | 1 | 0.589286 | 0.336735 | 0.428571 | 98.0 |
| 21 | XGBoost | 1 | 0.600000 | 0.336735 | 0.431373 | 98.0 |
| Model | Type | precision | recall | f1-score | support | |
|---|---|---|---|---|---|---|
| 2 | Simple | accuracy | 0.965794 | 0.965794 | 0.965794 | 0.965794 |
| 7 | Linear | accuracy | 0.967539 | 0.967539 | 0.967539 | 0.967539 |
| 12 | Logistic | accuracy | 0.969284 | 0.969284 | 0.969284 | 0.969284 |
| 17 | RandomForest | accuracy | 0.969284 | 0.969284 | 0.969284 | 0.969284 |
| 22 | XGBoost | accuracy | 0.969634 | 0.969634 | 0.969634 | 0.969634 |
| Model | Type | precision | recall | f1-score | support | |
|---|---|---|---|---|---|---|
| 4 | Simple | weighted avg | 0.932758 | 0.965794 | 0.948989 | 2865.0 |
| 9 | Linear | weighted avg | 0.958214 | 0.967539 | 0.956809 | 2865.0 |
| 14 | Logistic | weighted avg | 0.963183 | 0.969284 | 0.964701 | 2865.0 |
| 19 | RandomForest | weighted avg | 0.963603 | 0.969284 | 0.965212 | 2865.0 |
| 24 | XGBoost | weighted avg | 0.963977 | 0.969634 | 0.965483 | 2865.0 |
| Model | Type | precision | recall | f1-score | support | |
|---|---|---|---|---|---|---|
| 3 | Simple | macro avg | 0.482897 | 0.500000 | 0.491300 | 2865.0 |
| 8 | Linear | macro avg | 0.800681 | 0.559960 | 0.594280 | 2865.0 |
| 13 | Logistic | macro avg | 0.786168 | 0.654369 | 0.698781 | 2865.0 |
| 18 | RandomForest | macro avg | 0.783073 | 0.664211 | 0.706395 | 2865.0 |
| 23 | XGBoost | macro avg | 0.788434 | 0.664392 | 0.707886 | 2865.0 |
This has been slightly done in the previous section and is outlined in the problem scope section in the intro.
We are trying to predict whether a patient is likely to have a good (living) or bad (dying) prognosis based on a combination of demographic details and their patient history by creating ML models.
There are two key challenges with this:
To do this, we developed five models (simple, Linear Regression, Logistic Regression, Random Forest, and XGBoost) and compared their performance over both the training set and the test set, and analysed the results for each model, paying particular attention to the number of deaths which were produced and the overall precision.
Based on including ICU, Age, Medical Condition History, and Hospitalisation Status, we see that the Random Forest Model is the best performing model of the base models, and XGBoost is the overall strongest but was an additional model not mandatory for this assignment. The performance of Logistic Regression, Random Forest, and XGBoost were overall very comparable with minor performance differences between the three, while Linear Regression was poorly performant and the Simple Model, although it had a high accuracy, was totally inappropriate for the context of the problem.
Yes, we could create a Gender Specific Model. We see that gender distributions of death are different, so we could create one model for male patients, and one model for Female or Unknown Patients. We would then call the model which is relevant to the patients' gender. Alternatively, we could additional features or include an age split. We could also use XGBoost as I've already done.
def Male_Female_RF_Model(df):
def create_gender_RandomForest_model(fulldf,train_df,test_df,target_column,plot_comp, threshhold_class=0.5, assess=True, verbose=True,gender=None):
"""Create a random forest model
Note: The variable names say linear but that's only because it was first built for a linear model. Observe it does initialise the correct model type."""
#Full Set
X=fulldf.drop([target_column], axis=1)
y=fulldf[target_column]
#Test
X_test=test_df.drop([target_column], axis=1)
y_test= test_df[target_column]
#Train
X_train=train_df.drop([target_column], axis=1)
y_train=train_df[target_column]
#Create the DT Regression
lin_regression_model = RandomForestClassifier(n_estimators=100, max_features='auto', oob_score=True, random_state=14395076)
#Fit the data
lin_regression_model.fit(X_train,y_train)
#--End Coefficient Check--##
###-BEGIN TESTING ON TRAIN DATA - DO NOT USE THIS, INVALID BUT REQUIRED IN Q!!!!
print("---------------")
print("---------------")
print("As part of 3 we are meant to predict the data used to train the model (???)")
inv_lin_prediction=lin_regression_model.predict(X_train)
inv_linear_prediction_classified=np.where(inv_lin_prediction>=threshhold_class,1,0)
if plot_comp:
#Original Versus Prediction
print("The Original Test Vs Predicted Result for Test Set Is:")
plt.figure(figsize=(50,20))
#Each test is a point
x_axis = range(len(y_train))
plt.plot(x_axis, y_train, label="Original-Test")
#Add second plot for perdiction Class
plt.plot(x_axis, inv_linear_prediction_classified, label="Predicted-Test")
plt.title("COVID test and predicted data")
plt.legend()
#plt.savefig('./lin_pred_vs_orig_covid.png')
plt.show()
#Results
inv_pred_vs_act_df=pd.DataFrame({'Actual':y_train
,'Predicted':inv_lin_prediction
,'PredictionClass':inv_linear_prediction_classified})
inv_pred_vs_act_df['Predicted']=inv_pred_vs_act_df['Predicted']
inv_pred_vs_act_df['Diff']=inv_pred_vs_act_df['Actual']-inv_pred_vs_act_df['PredictionClass']
print("As required the First Ten Results predicting for the training data:")
display(inv_pred_vs_act_df.head(10))
#Metrics
inv_model_metric=model_metrics(testActualVal=y_train, predictions=inv_linear_prediction_classified, verbose=True)
print("---------------")
print("---------------")
##---END OF CHECKING ON TEST DATA
print("---------------")
print("---------------")
print(" PROPER RESULT FROM TEST DATA:")
#Check the predictions
lin_prediction = lin_regression_model.predict(X_test)
#Classify them - Note: Not needed for logistic but keeping as no impact
linear_prediction_classified=np.where(lin_prediction>=threshhold_class,1,0)
#Plot it
if plot_comp:
#Original Versus Prediction
print("Random Forest {}: The Original Vs Predicted Result Is:".format(gender))
plt.figure(figsize=(50,20))
#Each test is a point
x_axis = range(len(y_test))
plt.plot(x_axis, y_test, label="Original")
#Add second plot for perdiction Class
plt.plot(x_axis, linear_prediction_classified, label="Predicted")
plt.title("COVID test and predicted data")
plt.legend()
plt.savefig('./randomforest_{}_pred_vs_orig_covid.png'.format(gender))
plt.show()
#Save the Model
filename = './randomforest_{}_model_covid.pickle'.format(gender)
pickle.dump(lin_regression_model, open(filename, 'wb'))
feature_imp=pd.DataFrame({'feature': X_train.columns, 'importance': lin_regression_model.feature_importances_})
feature_imp=feature_imp.set_index('feature')
feature_imp=feature_imp.sort_values('importance', axis=0, ascending=False)
if plot_comp:
print("Random Forest: Feature Importance:")
plt.figure(figsize=(50,20))
#Plot from DF
feature_imp.plot(kind='barh')
#Add second plot for perdiction Class
plt.title("Random Forest Feature Importance")
plt.legend()
plt.savefig('./randomforest_{}_importance_covid.png'.format(gender))
plt.show()
#Results
pred_vs_act_df=pd.DataFrame({'Actual':y_test
,'Predicted':lin_prediction
,'PredictionClass':linear_prediction_classified})
pred_vs_act_df['Predicted']=pred_vs_act_df['Predicted']
pred_vs_act_df['Diff']=pred_vs_act_df['Actual']-pred_vs_act_df['PredictionClass']
#Metrics
model_metric=model_metrics(testActualVal=y_test, predictions=linear_prediction_classified, verbose=True)
scores = cross_val_score(RandomForestClassifier(n_estimators=100, max_features='auto', oob_score=True, random_state=1), X, y, scoring='accuracy', cv=5)
print(scores)
cv_rmse = scores**0.5
print("Avg Accuracy score over 5 folds: \n", np.mean(cv_rmse))
print("Stddev Accuracy score over 5 folds: \n", np.std(cv_rmse))
result_dict={}
result_dict['Gender']=gender
result_dict['Model']=lin_regression_model
result_dict['Model_Coefficients']=None#zip(X_train.columns,lin_regression_model.coef_)
result_dict['Actual vs Prediction']=pred_vs_act_df
result_dict['RMSE']=model_metric['RMSE']
result_dict['MSE']=model_metric['MSE']
result_dict['MAE']=model_metric['MAE']
result_dict['Accuracy']=model_metric['Accuracy']
result_dict['Confusion']=model_metric['Confusion']
result_dict['ClassificationRep']=model_metric['ClassificationRep']
result_dict['CrossVal_Acc_Mean']=np.mean(cv_rmse)
result_dict['CrossVal_Acc_Mean']=np.std(cv_rmse)
result_dict['FeatureImportance']=feature_imp
return result_dict
male_df=df[df['sex']=='Male']
not_male_df=df[df['sex']!='Male']
male_df=male_df.drop('sex',axis=1)
not_male_df=not_male_df.drop('sex',axis=1)
male_df=pd.get_dummies(male_df, columns=["age_group"
,"hosp_yn"
,"icu_yn"
,"medcond_yn"], drop_first=True)
not_male_df=pd.get_dummies(not_male_df, columns=["age_group"
,"hosp_yn"
,"icu_yn"
,"medcond_yn"], drop_first=True)
male_test_train_split=get_randomised_data(df=male_df,test_size=0.3)
not_male_test_train_split=get_randomised_data(df=not_male_df,test_size=0.3)
male_train_data=male_test_train_split['Train']
male_test_data=male_test_train_split['Test']
not_male_train_data=not_male_test_train_split['Train']
not_male_test_data=not_male_test_train_split['Test']
gender_dict={}
male_result=create_gender_RandomForest_model(fulldf=male_df
,train_df=male_train_data
,test_df=male_test_data
,target_column='death_yn'
,plot_comp=True
, threshhold_class=0.5
, assess=True
, verbose=True,gender='Male')
not_male_result=create_gender_RandomForest_model(fulldf=not_male_df
,train_df=not_male_train_data
,test_df=not_male_test_data
,target_column='death_yn'
,plot_comp=True
, threshhold_class=0.5
, assess=True
, verbose=True,gender='Female')
gender_dict['Male']=male_result
gender_dict['Female']=not_male_result
return gender_dict
new_predictive_features=["age_group"
,'sex'
,"hosp_yn"
,"icu_yn"
,"medcond_yn"]
new_keep_features=[target_column]
for column in new_predictive_features:
new_keep_features+=[column]
new_modelling_df=raw_df[new_keep_features]
#Transform target feature
new_modelling_df['death_yn']=new_modelling_df['death_yn'].astype(str)
new_modelling_df.loc[(new_modelling_df['death_yn']=='Yes'),'death_yn']=1
new_modelling_df.loc[(new_modelling_df['death_yn']=='No'),'death_yn']=0
new_modelling_df['death_yn']=new_modelling_df['death_yn'].astype(int)
Male_Female_RF_Model(df=new_modelling_df)
<ipython-input-43-5aa38962187e>:15: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy new_modelling_df['death_yn']=new_modelling_df['death_yn'].astype(str) /opt/miniconda3/envs/comp30830/lib/python3.9/site-packages/pandas/core/indexing.py:1720: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy self._setitem_single_column(loc, value, pi) <ipython-input-43-5aa38962187e>:18: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy new_modelling_df['death_yn']=new_modelling_df['death_yn'].astype(int)
--------------- --------------- As part of 3 we are meant to predict the data used to train the model (???) The Original Test Vs Predicted Result for Test Set Is:
As required the First Ten Results predicting for the training data:
| Actual | Predicted | PredictionClass | Diff | |
|---|---|---|---|---|
| 9469 | 0 | 0 | 0 | 0 |
| 376 | 1 | 1 | 1 | 0 |
| 9241 | 0 | 0 | 0 | 0 |
| 34 | 0 | 0 | 0 | 0 |
| 5749 | 0 | 0 | 0 | 0 |
| 1790 | 0 | 0 | 0 | 0 |
| 1208 | 0 | 0 | 0 | 0 |
| 3097 | 0 | 0 | 0 | 0 |
| 2316 | 0 | 0 | 0 | 0 |
| 8571 | 0 | 0 | 0 | 0 |
----REPORT----
MAE: 0.0287300032647731
MSE: 0.0287300032647731
RMSE: 0.1694992721659096
R2: 0.22435613363644225
----DETAIL----
Accuracy:
0.9712699967352268
Confusion matrix:
[[2921 24]
[ 64 54]]
Classification report:
{'0': {'precision': 0.9785594639865997, 'recall': 0.9918505942275042, 'f1-score': 0.9851602023608769, 'support': 2945}, '1': {'precision': 0.6923076923076923, 'recall': 0.4576271186440678, 'f1-score': 0.5510204081632654, 'support': 118}, 'accuracy': 0.9712699967352268, 'macro avg': {'precision': 0.835433578147146, 'recall': 0.724738856435786, 'f1-score': 0.7680903052620711, 'support': 3063}, 'weighted avg': {'precision': 0.9675318084011896, 'recall': 0.9712699967352268, 'f1-score': 0.9684352608932575, 'support': 3063}}
---------------
---------------
---------------
---------------
PROPER RESULT FROM TEST DATA:
Random Forest Male: The Original Vs Predicted Result Is:
Random Forest: Feature Importance:
<Figure size 3600x1440 with 0 Axes>
----REPORT----
MAE: 0.031226199543031227
MSE: 0.031226199543031227
RMSE: 0.17670936461611542
R2: 0.05655450403084439
----DETAIL----
Accuracy:
0.9687738004569688
Confusion matrix:
[[1258 10]
[ 31 14]]
Classification report:
{'0': {'precision': 0.9759503491078355, 'recall': 0.9921135646687698, 'f1-score': 0.9839655846695345, 'support': 1268}, '1': {'precision': 0.5833333333333334, 'recall': 0.3111111111111111, 'f1-score': 0.4057971014492754, 'support': 45}, 'accuracy': 0.9687738004569688, 'macro avg': {'precision': 0.7796418412205844, 'recall': 0.6516123378899404, 'f1-score': 0.694881343059405, 'support': 1313}, 'weighted avg': {'precision': 0.9624943203874604, 'recall': 0.9687738004569688, 'f1-score': 0.9641502139574922, 'support': 1313}}
[0.96575342 0.97142857 0.96685714 0.96342857 0.96685714]
Avg Accuracy score over 5 folds:
0.9832920321400753
Stddev Accuracy score over 5 folds:
0.0013229420371997278
---------------
---------------
As part of 3 we are meant to predict the data used to train the model (???)
The Original Test Vs Predicted Result for Test Set Is:
As required the First Ten Results predicting for the training data:
| Actual | Predicted | PredictionClass | Diff | |
|---|---|---|---|---|
| 5650 | 0 | 0 | 0 | 0 |
| 3056 | 0 | 0 | 0 | 0 |
| 2095 | 0 | 0 | 0 | 0 |
| 3698 | 0 | 0 | 0 | 0 |
| 4738 | 0 | 0 | 0 | 0 |
| 521 | 0 | 0 | 0 | 0 |
| 8261 | 0 | 0 | 0 | 0 |
| 2654 | 0 | 0 | 0 | 0 |
| 1713 | 0 | 0 | 0 | 0 |
| 9926 | 0 | 0 | 0 | 0 |
----REPORT----
MAE: 0.032044198895027624
MSE: 0.032044198895027624
RMSE: 0.17900893523795852
R2: 0.0675468924521303
----DETAIL----
Accuracy:
0.9679558011049724
Confusion matrix:
[[3468 23]
[ 93 36]]
Classification report:
{'0': {'precision': 0.9738837405223252, 'recall': 0.9934116299054712, 'f1-score': 0.9835507657402155, 'support': 3491}, '1': {'precision': 0.6101694915254238, 'recall': 0.27906976744186046, 'f1-score': 0.3829787234042553, 'support': 129}, 'accuracy': 0.9679558011049724, 'macro avg': {'precision': 0.7920266160238745, 'recall': 0.6362406986736658, 'f1-score': 0.6832647445722354, 'support': 3620}, 'weighted avg': {'precision': 0.9609226526437064, 'recall': 0.9679558011049724, 'f1-score': 0.9621491653365307, 'support': 3620}}
---------------
---------------
---------------
---------------
PROPER RESULT FROM TEST DATA:
Random Forest Female: The Original Vs Predicted Result Is:
Random Forest: Feature Importance:
<Figure size 3600x1440 with 0 Axes>
----REPORT----
MAE: 0.03157216494845361
MSE: 0.03157216494845361
RMSE: 0.17768557889838332
R2: -0.2574074074074073
----DETAIL----
Accuracy:
0.9684278350515464
Confusion matrix:
[[1495 17]
[ 32 8]]
Classification report:
{'0': {'precision': 0.9790438768827767, 'recall': 0.9887566137566137, 'f1-score': 0.9838762750904904, 'support': 1512}, '1': {'precision': 0.32, 'recall': 0.2, 'f1-score': 0.24615384615384614, 'support': 40}, 'accuracy': 0.9684278350515464, 'macro avg': {'precision': 0.6495219384413884, 'recall': 0.5943783068783068, 'f1-score': 0.6150150606221683, 'support': 1552}, 'weighted avg': {'precision': 0.9620582099528082, 'recall': 0.9684278350515464, 'f1-score': 0.9648628104271747, 'support': 1552}}
[0.96328502 0.96328502 0.96711799 0.95841393 0.95841393]
Avg Accuracy score over 5 folds:
0.9808671224562737
Stddev Accuracy score over 5 folds:
0.0016929337597882945
{'Male': {'Gender': 'Male',
'Model': RandomForestClassifier(oob_score=True, random_state=14395076),
'Model_Coefficients': None,
'Actual vs Prediction': Actual Predicted PredictionClass Diff
6256 0 1 1 -1
2563 0 0 0 0
7238 0 0 0 0
4052 0 0 0 0
4499 0 0 0 0
... ... ... ... ...
8819 0 0 0 0
9830 0 0 0 0
8845 0 0 0 0
3834 0 0 0 0
1232 0 0 0 0
[1313 rows x 4 columns],
'RMSE': 0.17670936461611542,
'MSE': 0.031226199543031227,
'MAE': 0.031226199543031227,
'Accuracy': 0.9687738004569688,
'Confusion': array([[1258, 10],
[ 31, 14]]),
'ClassificationRep': {'0': {'precision': 0.9759503491078355,
'recall': 0.9921135646687698,
'f1-score': 0.9839655846695345,
'support': 1268},
'1': {'precision': 0.5833333333333334,
'recall': 0.3111111111111111,
'f1-score': 0.4057971014492754,
'support': 45},
'accuracy': 0.9687738004569688,
'macro avg': {'precision': 0.7796418412205844,
'recall': 0.6516123378899404,
'f1-score': 0.694881343059405,
'support': 1313},
'weighted avg': {'precision': 0.9624943203874604,
'recall': 0.9687738004569688,
'f1-score': 0.9641502139574922,
'support': 1313}},
'CrossVal_Acc_Mean': 0.0013229420371997278,
'FeatureImportance': importance
feature
hosp_yn_Yes 0.305438
age_group_80+ Years 0.215937
icu_yn_Yes 0.129953
icu_yn_Unknown 0.063140
medcond_yn_Yes 0.058458
age_group_70 - 79 Years 0.042683
medcond_yn_Unknown 0.040627
age_group_60 - 69 Years 0.031442
hosp_yn_Unknown 0.025876
age_group_50 - 59 Years 0.025747
age_group_40 - 49 Years 0.025362
age_group_30 - 39 Years 0.018269
age_group_20 - 29 Years 0.009800
age_group_10 - 19 Years 0.007258
age_group_Unknown 0.000012
hosp_yn_OTH 0.000000},
'Female': {'Gender': 'Female',
'Model': RandomForestClassifier(oob_score=True, random_state=14395076),
'Model_Coefficients': None,
'Actual vs Prediction': Actual Predicted PredictionClass Diff
1814 0 0 0 0
188 0 0 0 0
8134 0 0 0 0
6007 0 0 0 0
9056 0 0 0 0
... ... ... ... ...
8973 0 0 0 0
1106 0 0 0 0
421 0 0 0 0
7129 0 0 0 0
6294 0 0 0 0
[1552 rows x 4 columns],
'RMSE': 0.17768557889838332,
'MSE': 0.03157216494845361,
'MAE': 0.03157216494845361,
'Accuracy': 0.9684278350515464,
'Confusion': array([[1495, 17],
[ 32, 8]]),
'ClassificationRep': {'0': {'precision': 0.9790438768827767,
'recall': 0.9887566137566137,
'f1-score': 0.9838762750904904,
'support': 1512},
'1': {'precision': 0.32,
'recall': 0.2,
'f1-score': 0.24615384615384614,
'support': 40},
'accuracy': 0.9684278350515464,
'macro avg': {'precision': 0.6495219384413884,
'recall': 0.5943783068783068,
'f1-score': 0.6150150606221683,
'support': 1552},
'weighted avg': {'precision': 0.9620582099528082,
'recall': 0.9684278350515464,
'f1-score': 0.9648628104271747,
'support': 1552}},
'CrossVal_Acc_Mean': 0.0016929337597882945,
'FeatureImportance': importance
feature
age_group_80+ Years 0.375111
hosp_yn_Yes 0.230273
icu_yn_Yes 0.087735
age_group_70 - 79 Years 0.054024
hosp_yn_Unknown 0.045977
icu_yn_Unknown 0.045225
medcond_yn_Yes 0.041613
age_group_60 - 69 Years 0.024035
medcond_yn_Unknown 0.021573
age_group_50 - 59 Years 0.016164
age_group_40 - 49 Years 0.014524
age_group_30 - 39 Years 0.014491
age_group_20 - 29 Years 0.014372
age_group_Unknown 0.011273
age_group_10 - 19 Years 0.003607
hosp_yn_OTH 0.000000}}
Although it is split in the above by gender rather than combined, I can clearly see in reviewing the results that this model is less predictive than the original model. Likely, this is because the data has been diluted by splitting it and has resulted in poorer training of the model. As such, the split by age group which I had previously suggested is something which I know longer believe would be useful.
We could try add in additional features rather than just 4. Although this is likely to result in dimensionality issues we could see what happens
new_predictive_features=["sex",
"age_group",
"hosp_yn",
"icu_yn",
"medcond_yn",
"race",
"days_until_onset",
"onset_present",
"cdc_case_earliest_weekday",
"cdc_case_earliest_month",
"cdc_case_earliest_year",
"demographic_missing",
"medical_missing"]
new_keep_features=[target_column]
for column in new_predictive_features:
new_keep_features+=[column]
new_modelling_df=raw_df[new_keep_features]
#Transform target feature
new_modelling_df['death_yn']=new_modelling_df['death_yn'].astype(str)
new_modelling_df.loc[(new_modelling_df['death_yn']=='Yes'),'death_yn']=1
new_modelling_df.loc[(new_modelling_df['death_yn']=='No'),'death_yn']=0
new_modelling_df['death_yn']=new_modelling_df['death_yn'].astype(int)
new_modelling_dummy_df=pd.get_dummies(new_modelling_df, columns=new_predictive_features, drop_first=True)
new_test_train_split=get_randomised_data(df=new_modelling_dummy_df,test_size=0.3)
new_train_data=new_test_train_split['Train']
new_test_data=new_test_train_split['Test']
<ipython-input-44-dbf6f3474b0a>:23: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy new_modelling_df['death_yn']=new_modelling_df['death_yn'].astype(str) /opt/miniconda3/envs/comp30830/lib/python3.9/site-packages/pandas/core/indexing.py:1720: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy self._setitem_single_column(loc, value, pi) <ipython-input-44-dbf6f3474b0a>:26: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy new_modelling_df['death_yn']=new_modelling_df['death_yn'].astype(int)
new_modelling_dummy_df
| death_yn | sex_Male | sex_Unknown | age_group_10 - 19 Years | age_group_20 - 29 Years | age_group_30 - 39 Years | age_group_40 - 49 Years | age_group_50 - 59 Years | age_group_60 - 69 Years | age_group_70 - 79 Years | ... | cdc_case_earliest_month_6 | cdc_case_earliest_month_7 | cdc_case_earliest_month_8 | cdc_case_earliest_month_9 | cdc_case_earliest_month_10 | cdc_case_earliest_month_11 | cdc_case_earliest_month_12 | cdc_case_earliest_year_2021 | demographic_missing_True | medical_missing_True | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 |
| 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
| 2 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
| 4 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 9994 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
| 9995 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
| 9997 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
| 9998 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 |
| 9999 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
9548 rows × 55 columns
new_linear_model_result_dictionary=(
create_model(fulldf=new_modelling_dummy_df
,train_df=new_train_data
,test_df=new_test_data
,target_column=target_column
,plot_comp=True
, threshhold_class=0.5
, assess=True
, verbose=True
,mod_type='Linear')
)
death_yn =
('sex_Male' * 0.006901505560004701)
+('sex_Unknown' * -0.004497066978747244)
+('age_group_10 - 19 Years' * 0.000607672523391653)
+('age_group_20 - 29 Years' * -0.004622936679568382)
+('age_group_30 - 39 Years' * -0.0036750975613204853)
+('age_group_40 - 49 Years' * -0.007793547339427581)
+('age_group_50 - 59 Years' * -0.0037815179359893215)
+('age_group_60 - 69 Years' * 0.02803302958048757)
+('age_group_70 - 79 Years' * 0.08858345013918513)
+('age_group_80+ Years' * 0.25293360382332664)
+('age_group_Unknown' * 0.10744646330387347)
+('hosp_yn_OTH' * -7.966717563423487e-16)
+('hosp_yn_Unknown' * 0.01486280088811685)
+('hosp_yn_Yes' * 0.16257323234806542)
+('icu_yn_Unknown' * 0.026367060305329246)
+('icu_yn_Yes' * 0.2727255294613255)
+('medcond_yn_Unknown' * 0.002444624109411736)
+('medcond_yn_Yes' * 0.018213048132006193)
+('race_Asian' * -0.022135440071506404)
+('race_Black' * -0.020386261084608268)
+('race_Hispanic/Latino' * -0.007630818410974988)
+('race_Multiple/Other' * -0.02855818625867826)
+('race_Native Hawaiian/Other Pacific Islander' * -0.011653393670996076)
+('race_Unknown' * -0.036629250990292364)
+('race_White' * -0.008784315493858512)
+('days_until_onset_1.0' * 0.048180193906087215)
+('days_until_onset_2.0' * 0.006855053355228846)
+('days_until_onset_3.0' * -0.018849265080594216)
+('days_until_onset_4.0' * -0.03502640929315003)
+('days_until_onset_5.0' * 0.024551486663280435)
+('days_until_onset_6.0' * -0.07345952518821125)
+('days_until_onset_7.0' * 0.019842326061002162)
+('days_until_onset_10.0' * 0.00280743089951635)
+('onset_present_True' * -0.007820289555383289)
+('cdc_case_earliest_weekday_1' * -0.008140196407170189)
+('cdc_case_earliest_weekday_2' * -0.0004785700364238199)
+('cdc_case_earliest_weekday_3' * -0.008543204222048649)
+('cdc_case_earliest_weekday_4' * -8.974612004899324e-05)
+('cdc_case_earliest_weekday_5' * 0.000972580973510058)
+('cdc_case_earliest_weekday_6' * 0.006175388213298605)
+('cdc_case_earliest_month_2' * 0.49831084408935616)
+('cdc_case_earliest_month_3' * 0.12011524784258149)
+('cdc_case_earliest_month_4' * 0.15307380942546506)
+('cdc_case_earliest_month_5' * 0.09747423969350665)
+('cdc_case_earliest_month_6' * 0.0946320127876043)
+('cdc_case_earliest_month_7' * 0.0830295077357964)
+('cdc_case_earliest_month_8' * 0.08411828537736243)
+('cdc_case_earliest_month_9' * 0.08035491132740767)
+('cdc_case_earliest_month_10' * 0.0772315781276825)
+('cdc_case_earliest_month_11' * 0.08100591606555278)
+('cdc_case_earliest_month_12' * 0.07903502996276025)
+('cdc_case_earliest_year_2021' * 0.07373477431993411)
+('demographic_missing_True' * 0.005047396274734002)
+('medical_missing_True' * -0.013558962166056066) + (-0.08155627092430544)
---------------
---------------
As part of 3 we are meant to predict the data used to train the model (???)
The Original Test Vs Predicted Result for Test Set Is:
As required the First Ten Results predicting for the training data:
| Actual | Predicted | PredictionClass | Diff | |
|---|---|---|---|---|
| 8396 | 0 | 0.032470 | 0 | 0 |
| 987 | 0 | 0.058129 | 0 | 0 |
| 7274 | 0 | -0.015903 | 0 | 0 |
| 1000 | 0 | 0.034344 | 0 | 0 |
| 4848 | 0 | 0.024371 | 0 | 0 |
| 9819 | 0 | 0.012987 | 0 | 0 |
| 7109 | 0 | -0.008743 | 0 | 0 |
| 3123 | 0 | -0.004605 | 0 | 0 |
| 5279 | 1 | 0.162697 | 0 | 1 |
| 7752 | 0 | 0.017781 | 0 | 0 |
----REPORT----
MAE: 0.03306898099655843
MSE: 0.03306898099655843
RMSE: 0.1818487860739203
R2: 0.02128667665960282
----DETAIL----
Accuracy:
0.9669310190034416
Confusion matrix:
[[6443 6]
[ 215 19]]
Classification report:
{'0': {'precision': 0.9677080204265546, 'recall': 0.9990696231973949, 'f1-score': 0.9831387808041504, 'support': 6449}, '1': {'precision': 0.76, 'recall': 0.0811965811965812, 'f1-score': 0.14671814671814673, 'support': 234}, 'accuracy': 0.9669310190034416, 'macro avg': {'precision': 0.8638540102132772, 'recall': 0.5401331021969881, 'f1-score': 0.5649284637611486, 'support': 6683}, 'weighted avg': {'precision': 0.9604352871062173, 'recall': 0.9669310190034416, 'f1-score': 0.9538521687472711, 'support': 6683}}
---------------
---------------
---------------
---------------
PROPER RESULT FROM TEST DATA:
The Original Vs Predicted Result Is:
----REPORT----
MAE: 0.03315881326352531
MSE: 0.03315881326352531
RMSE: 0.1820956157174722
R2: -0.003720967967960398
----DETAIL----
Accuracy:
0.9668411867364747
Confusion matrix:
[[2760 7]
[ 88 10]]
Classification report:
{'0': {'precision': 0.9691011235955056, 'recall': 0.9974701843151428, 'f1-score': 0.9830810329474622, 'support': 2767}, '1': {'precision': 0.5882352941176471, 'recall': 0.10204081632653061, 'f1-score': 0.1739130434782609, 'support': 98}, 'accuracy': 0.9668411867364747, 'macro avg': {'precision': 0.7786682088565764, 'recall': 0.5497555003208368, 'f1-score': 0.5784970382128616, 'support': 2865}, 'weighted avg': {'precision': 0.9560732522905038, 'recall': 0.9668411867364747, 'f1-score': 0.9554026863617792, 'support': 2865}}
[0.02205627 0.02405428 0.02654312 0.0244092 0.02530132]
Avg RMSE score over 5 folds:
0.15636537460014627
Stddev RMSE score over 5 folds:
0.004765213675823956
new_logistic_model_result_dictionary=(
create_model(fulldf=new_modelling_dummy_df
,train_df=new_train_data
,test_df=new_test_data
,target_column=target_column
,plot_comp=True
, threshhold_class=0.5
, assess=True
, verbose=True
,mod_type='Logistic')
)
death_yn = logistic(
('sex_Male' * 0.2931787764996919)
+('sex_Unknown' * 0.003980533277351162)
+('age_group_10 - 19 Years' * -1.363216229038601)
+('age_group_20 - 29 Years' * -1.761459231843649)
+('age_group_30 - 39 Years' * -1.4207781202581289)
+('age_group_40 - 49 Years' * -0.6361183521268586)
+('age_group_50 - 59 Years' * -0.026382335696195046)
+('age_group_60 - 69 Years' * 1.3180702413664858)
+('age_group_70 - 79 Years' * 2.0595491566787993)
+('age_group_80+ Years' * 3.19327201387594)
+('age_group_Unknown' * 0.8749234037068214)
+('hosp_yn_OTH' * 0.0)
+('hosp_yn_Unknown' * 0.8795050324158765)
+('hosp_yn_Yes' * 2.1647519434015075)
+('icu_yn_Unknown' * 0.26396277744521646)
+('icu_yn_Yes' * 1.987871310440442)
+('medcond_yn_Unknown' * 0.34857702232668303)
+('medcond_yn_Yes' * 0.6642011421430392)
+('race_Asian' * -0.431948193435839)
+('race_Black' * -0.024272436622114348)
+('race_Hispanic/Latino' * 0.49281513436803076)
+('race_Multiple/Other' * -0.3332154772075355)
+('race_Native Hawaiian/Other Pacific Islander' * -0.05277313434246174)
+('race_Unknown' * -0.5762991587092293)
+('race_White' * 0.20496169840743123)
+('days_until_onset_1.0' * 0.8075339074139909)
+('days_until_onset_2.0' * -0.027665290499748012)
+('days_until_onset_3.0' * -0.15043595980670862)
+('days_until_onset_4.0' * -0.06033493356858785)
+('days_until_onset_5.0' * -0.0003662330855426008)
+('days_until_onset_6.0' * -0.2061901660522154)
+('days_until_onset_7.0' * -0.0013062518170191466)
+('days_until_onset_10.0' * -0.0008978613931723872)
+('onset_present_True' * -0.3762871330604844)
+('cdc_case_earliest_weekday_1' * -0.31172194691886584)
+('cdc_case_earliest_weekday_2' * 0.07837934677231513)
+('cdc_case_earliest_weekday_3' * -0.2272190163826874)
+('cdc_case_earliest_weekday_4' * 0.007921990953652952)
+('cdc_case_earliest_weekday_5' * 0.08331006032833918)
+('cdc_case_earliest_weekday_6' * 0.1422771819734833)
+('cdc_case_earliest_month_2' * 0.8328480863048937)
+('cdc_case_earliest_month_3' * 0.647903610908308)
+('cdc_case_earliest_month_4' * 0.8907632901766741)
+('cdc_case_earliest_month_5' * 0.427604613866425)
+('cdc_case_earliest_month_6' * 0.15831080963796873)
+('cdc_case_earliest_month_7' * -0.1975586528965008)
+('cdc_case_earliest_month_8' * -0.09409320586353237)
+('cdc_case_earliest_month_9' * -0.3416130039551457)
+('cdc_case_earliest_month_10' * -0.4506354159781141)
+('cdc_case_earliest_month_11' * -0.3713292113467674)
+('cdc_case_earliest_month_12' * -0.364335730496805)
+('cdc_case_earliest_year_2021' * -0.9060107230958336)
+('demographic_missing_True' * -0.0817118299795506)
+('medical_missing_True' * 0.15089235625491876) + ([-5.66789785]))
---------------
---------------
As part of 3 we are meant to predict the data used to train the model (???)
The Original Test Vs Predicted Result for Test Set Is:
As required the First Ten Results predicting for the training data:
| Actual | Predicted | PredictionClass | Diff | |
|---|---|---|---|---|
| 8396 | 0 | 0 | 0 | 0 |
| 987 | 0 | 0 | 0 | 0 |
| 7274 | 0 | 0 | 0 | 0 |
| 1000 | 0 | 0 | 0 | 0 |
| 4848 | 0 | 0 | 0 | 0 |
| 9819 | 0 | 0 | 0 | 0 |
| 7109 | 0 | 0 | 0 | 0 |
| 3123 | 0 | 0 | 0 | 0 |
| 5279 | 1 | 0 | 0 | 1 |
| 7752 | 0 | 0 | 0 | 0 |
----REPORT----
MAE: 0.0321711806075116
MSE: 0.0321711806075116
RMSE: 0.17936326437571212
R2: 0.047858079103233475
----DETAIL----
Accuracy:
0.9678288193924884
Confusion matrix:
[[6405 44]
[ 171 63]]
Classification report:
{'0': {'precision': 0.9739963503649635, 'recall': 0.9931772367808963, 'f1-score': 0.9834932821497121, 'support': 6449}, '1': {'precision': 0.5887850467289719, 'recall': 0.2692307692307692, 'f1-score': 0.36950146627565983, 'support': 234}, 'accuracy': 0.9678288193924884, 'macro avg': {'precision': 0.7813906985469676, 'recall': 0.6312040030058328, 'f1-score': 0.676497374212686, 'support': 6683}, 'weighted avg': {'precision': 0.9605084788924478, 'recall': 0.9678288193924884, 'f1-score': 0.9619948405943434, 'support': 6683}}
---------------
---------------
---------------
---------------
PROPER RESULT FROM TEST DATA:
Log: The Original Vs Predicted Result Is:
----REPORT----
MAE: 0.030366492146596858
MSE: 0.030366492146596858
RMSE: 0.17425984088882
R2: 0.08080290301881532
----DETAIL----
Accuracy:
0.9696335078534032
Confusion matrix:
[[2749 18]
[ 69 29]]
Classification report:
{'0': {'precision': 0.975514549325763, 'recall': 0.9934947596675099, 'f1-score': 0.9844225604297224, 'support': 2767}, '1': {'precision': 0.6170212765957447, 'recall': 0.29591836734693877, 'f1-score': 0.4, 'support': 98}, 'accuracy': 0.9696335078534032, 'macro avg': {'precision': 0.7962679129607538, 'recall': 0.6447065635072243, 'f1-score': 0.6922112802148612, 'support': 2865}, 'weighted avg': {'precision': 0.9632519522131829, 'recall': 0.9696335078534032, 'f1-score': 0.9644318410851803, 'support': 2865}}
[0.03403141 0.02722513 0.03193717 0.03404924 0.03719225]
Avg RMSE score over 5 folds:
0.18111269431036137
Stddev RMSE score over 5 folds:
0.009232169469685
new_rf_model_result_dictionary=(
create_model(fulldf=new_modelling_dummy_df
,train_df=new_train_data
,test_df=new_test_data
,target_column=target_column
,plot_comp=True
, threshhold_class=0.5
, assess=True
, verbose=True
,mod_type='Random Forest')
)
--------------- --------------- As part of 3 we are meant to predict the data used to train the model (???) The Original Test Vs Predicted Result for Test Set Is:
As required the First Ten Results predicting for the training data:
| Actual | Predicted | PredictionClass | Diff | |
|---|---|---|---|---|
| 8396 | 0 | 0 | 0 | 0 |
| 987 | 0 | 0 | 0 | 0 |
| 7274 | 0 | 0 | 0 | 0 |
| 1000 | 0 | 0 | 0 | 0 |
| 4848 | 0 | 0 | 0 | 0 |
| 9819 | 0 | 0 | 0 | 0 |
| 7109 | 0 | 0 | 0 | 0 |
| 3123 | 0 | 0 | 0 | 0 |
| 5279 | 1 | 1 | 1 | 0 |
| 7752 | 0 | 0 | 0 | 0 |
----REPORT----
MAE: 0.0023941343707915607
MSE: 0.0023941343707915607
RMSE: 0.04892989240527268
R2: 0.9291429268169848
----DETAIL----
Accuracy:
0.9976058656292084
Confusion matrix:
[[6445 4]
[ 12 222]]
Classification report:
{'0': {'precision': 0.9981415518042435, 'recall': 0.9993797487982633, 'f1-score': 0.9987602665426933, 'support': 6449}, '1': {'precision': 0.9823008849557522, 'recall': 0.9487179487179487, 'f1-score': 0.9652173913043478, 'support': 234}, 'accuracy': 0.9976058656292084, 'macro avg': {'precision': 0.9902212183799979, 'recall': 0.974048848758106, 'f1-score': 0.9819888289235206, 'support': 6683}, 'weighted avg': {'precision': 0.9975869032867294, 'recall': 0.9976058656292084, 'f1-score': 0.9975857890915827, 'support': 6683}}
---------------
---------------
---------------
---------------
PROPER RESULT FROM TEST DATA:
Random Forest: The Original Vs Predicted Result Is:
Random Forest: Feature Importance:
<Figure size 3600x1440 with 0 Axes>
----REPORT----
MAE: 0.030715532286212915
MSE: 0.030715532286212915
RMSE: 0.17525847279436424
R2: 0.07023741914546833
----DETAIL----
Accuracy:
0.9692844677137871
Confusion matrix:
[[2751 16]
[ 72 26]]
Classification report:
{'0': {'precision': 0.9744952178533475, 'recall': 0.9942175641488977, 'f1-score': 0.984257602862254, 'support': 2767}, '1': {'precision': 0.6190476190476191, 'recall': 0.2653061224489796, 'f1-score': 0.37142857142857144, 'support': 98}, 'accuracy': 0.9692844677137871, 'macro avg': {'precision': 0.7967714184504833, 'recall': 0.6297618432989387, 'f1-score': 0.6778430871454127, 'support': 2865}, 'weighted avg': {'precision': 0.9623368008610398, 'recall': 0.9692844677137871, 'f1-score': 0.9632952136544003, 'support': 2865}}
[0.96335079 0.97015707 0.96125654 0.96542693 0.96385542]
Avg Accuracy score over 5 folds:
0.9822459154149904
Stddev Accuracy score over 5 folds:
0.0015200616444550976
new_xgboost_model_result_dictionary=(
create_model(fulldf=new_modelling_dummy_df
,train_df=new_train_data
,test_df=new_test_data
,target_column=target_column
,plot_comp=True
, threshhold_class=0.5
, assess=True
, verbose=True
,mod_type='XGBoost')
)
Model Training Score: 98.02483914409696% --------------- --------------- As part of 3 we are meant to predict the data used to train the model (???) The Original Test Vs Predicted Result for Test Set Is:
As required the First Ten Results predicting for the training data:
| Actual | Predicted | PredictionClass | Diff | |
|---|---|---|---|---|
| 8396 | 0 | 0 | 0 | 0 |
| 987 | 0 | 0 | 0 | 0 |
| 7274 | 0 | 0 | 0 | 0 |
| 1000 | 0 | 0 | 0 | 0 |
| 4848 | 0 | 0 | 0 | 0 |
| 9819 | 0 | 0 | 0 | 0 |
| 7109 | 0 | 0 | 0 | 0 |
| 3123 | 0 | 0 | 0 | 0 |
| 5279 | 1 | 0 | 0 | 1 |
| 7752 | 0 | 0 | 0 | 0 |
----REPORT----
MAE: 0.019751608559030374
MSE: 0.019751608559030374
RMSE: 0.14054041610522708
R2: 0.4154291462401247
----DETAIL----
Accuracy:
0.9802483914409696
Confusion matrix:
[[6429 20]
[ 112 122]]
Classification report:
{'0': {'precision': 0.9828772358966519, 'recall': 0.9968987439913165, 'f1-score': 0.9898383371824481, 'support': 6449}, '1': {'precision': 0.8591549295774648, 'recall': 0.5213675213675214, 'f1-score': 0.648936170212766, 'support': 234}, 'accuracy': 0.9802483914409696, 'macro avg': {'precision': 0.9210160827370584, 'recall': 0.759133132679419, 'f1-score': 0.819387253697607, 'support': 6683}, 'weighted avg': {'precision': 0.9785451964415134, 'recall': 0.9802483914409696, 'f1-score': 0.9779019153552888, 'support': 6683}}
---------------
---------------
Model Accuracy: [97.22513089 96.54450262 96.33507853 96.33315872 96.07124149]
The Original Vs Predicted Result Is:
----REPORT----
MAE: 0.030715532286212915
MSE: 0.030715532286212915
RMSE: 0.17525847279436424
R2: 0.07023741914546833
----DETAIL----
Accuracy:
0.9692844677137871
Confusion matrix:
[[2744 23]
[ 65 33]]
Classification report:
{'0': {'precision': 0.9768600925596298, 'recall': 0.9916877484640405, 'f1-score': 0.9842180774748924, 'support': 2767}, '1': {'precision': 0.5892857142857143, 'recall': 0.336734693877551, 'f1-score': 0.4285714285714286, 'support': 98}, 'accuracy': 0.9692844677137871, 'macro avg': {'precision': 0.783072903422672, 'recall': 0.6642112211707958, 'f1-score': 0.7063947530231605, 'support': 2865}, 'weighted avg': {'precision': 0.963602749079405, 'recall': 0.9692844677137871, 'f1-score': 0.9652116650516676, 'support': 2865}}
Importance by Booster Plot
Importance by Weight:
The Tree Is:
As part of testing extensions of our model, we have tried:
In all cases, each of the first two model potential improvements have resulted in an overall inferior model, except for the extension of the model to include all features for XGBoost where we got similar performance.
Based on these results, and particularly driven by the Xgboost model only attaining a performance onpar with the original model, I suspect the key requirement for improving results further will be to gather additional data, or as I have demonstrated with the usage of XGBoost, developing a new model.
Ideally, we would also gather more granular patient data to develop more 'hard-hitting' features such as history of pulmonary illness or cardiovascular risk indicators which are significant for COVID patients. Based on the performance of the XGBoost model, I believe there is only negligible room for improvement using these features outside of more sophisticated methods which are outside the scope of this module, and think the best chance for improvement will come from additional data being used to train the model.
- (4.3) Take your best model trained and selected based on past data (ie your cleaned Homework1 dataset), and evaluate it on the new test dataset provided with this homework (in file '24032021-covid19-cdc-deathyn-recent-10k.csv'). Discuss your findings.
First, I read in the new file and determine it has data quality errors. As a result, I have to copy my assignment 1 file into a function. Because I'm assuming duplicates are required as part of the prediction, I do not drop rows which contain duplicates unlike in my assignment 1 submission, but all other components are the same. I then create a new Random Forest model which is trained over all of the Historic data file and use that for the prediction of each row in the new cleansed file.
Note: I do not just call predict directly on the file I already trained as this was trained on the training data split i.e. 70% of the original file. Copying the existing function and setting the training data as the original data, and the new file as the test data was a lot more convenient as it also included analysis. I create both an XGBoost model and also a RF model (as the best of the models we were required to make).
Read in the new file.
new_data_df=pd.read_csv("24032021-covid19-cdc-deathyn-recent-10k.csv")
Define a function that cleanses it identically to my Assignment 1 data
def assignment_1_cleansing(file):
"""Replicates the assignment 1 cleansing process on a new file.
It does not drop duplicates as I'm assuming we need to test on the 10k samples.
If you want to understand how and why this works please refer to assignment 1."""
raw_covid_sample_data_df=ingest_orig_covid_data(file,data_dictionary_per_cdc)
staging_covid_sample_data_df=raw_covid_sample_data_df.copy(False)
datetime_columns=[column_headers for column_headers, column_desc_array in data_dictionary_per_cdc.items() if column_desc_array[1] == 'datetime']
categorical_columns=[column_headers for column_headers, column_desc_array in data_dictionary_per_cdc.items() if column_desc_array[1] == 'category']
num_columns=[column_headers for column_headers, column_desc_array in data_dictionary_per_cdc.items() if column_desc_array[1] == 'numeric']
numeric_columns=[column_headers for column_headers, column_desc_array in data_dictionary_per_cdc.items() if column_desc_array[1] == 'numeric']
datetime_format='%Y/%m/%d'
data_convert(staging_covid_sample_data_df,'datetime',datetime_columns,datetime_format)
data_convert(staging_covid_sample_data_df,'category',categorical_columns,datetime_format)
data_convert(staging_covid_sample_data_df,'numeric',num_columns,datetime_format)
#Don't Drop Duplicates
staging_covid_sample_data_df=staging_covid_sample_data_df#.drop_duplicates()
display(staging_covid_sample_data_df)
new_shape=staging_covid_sample_data_df.shape
new_row_count=new_shape[0]
print()
print("There are {} duplicates.".format(staging_covid_sample_data_df.duplicated().sum()))
staging_covid_sample_data_df=staging_covid_sample_data_df[['cdc_case_earliest_dt', 'cdc_report_dt', 'pos_spec_dt', 'onset_dt',
'current_status', 'sex', 'age_group', 'race_ethnicity_combined',
'hosp_yn', 'icu_yn', 'death_yn', 'medcond_yn']]
staging_covid_sample_data_df.to_csv("02_cleaned_24032021-covid19-cdc-deathyn-recent-10k.csv", index_label=False)
try:
staging_covid_sample_data_df[['cdc_case_earliest_dt', 'cdc_report_dt', 'pos_spec_dt', 'onset_dt',
'current_status', 'sex', 'age_group', 'race_ethnicity_combined',
'hosp_yn', 'icu_yn', 'death_yn', 'medcond_yn']].to_pickle("02_cleaned_24032021-covid19-cdc-deathyn-recent-10k.csv"[:-3]+'pickle')
except:
print('You need to install the pickle module appropriately.')
#Read Pickle
try:
deduped_covid_sample_df=pd.read_pickle("02_cleaned_24032021-covid19-cdc-deathyn-recent-10k.csv"[:-3]+'pickle')
#You don't have pickle available
except:
deduped_covid_sample_df=ingest_orig_covid_data("02_cleaned_24032021-covid19-cdc-deathyn-recent-10k.csv",data_dictionary_per_cdc)
data_convert(deduped_covid_sample_df,'datetime',datetime_columns,datetime_format)
data_convert(deduped_covid_sample_df,'category',categorical_columns,datetime_format)
data_convert(deduped_covid_sample_df,'numeric',num_columns,datetime_format)
for cat_column in categorical_columns:
#This is an appalling work around
deduped_covid_sample_df[cat_column]=deduped_covid_sample_df[cat_column].astype(str)
deduped_covid_sample_df.loc[(deduped_covid_sample_df[cat_column].isin(['Missing','Unknown'])),cat_column]='Unknown'
deduped_covid_sample_df[cat_column]=deduped_covid_sample_df[cat_column].astype('category')
deduped_covid_sample_df.loc[((deduped_covid_sample_df['current_status'] == 'Probable Case') & (deduped_covid_sample_df["pos_spec_dt"].notna())),'current_status']='Laboratory-confirmed case'
non_pos_spec=''
non_pos_spec=[x for x in deduped_covid_sample_df.columns if x!='pos_spec_dt']
deduped_covid_sample_df=deduped_covid_sample_df[non_pos_spec]
deduped_covid_sample_df=deduped_covid_sample_df.drop(deduped_covid_sample_df[deduped_covid_sample_df['hosp_yn']=='OTH'].index)
deduped_covid_sample_df.loc[(deduped_covid_sample_df['hosp_yn']!='Yes')&(deduped_covid_sample_df['icu_yn']=='Yes'),'hosp_yn']='Yes'
non_rep_dt=[x for x in deduped_covid_sample_df.columns if x!='cdc_report_dt']
deduped_covid_sample_df=deduped_covid_sample_df[non_rep_dt]
race_df=deduped_covid_sample_df['race_ethnicity_combined']
race_df = deduped_covid_sample_df['race_ethnicity_combined'].str.split(',',expand=True)
race_df.columns=['race','ethnicity']
race_df.loc[(race_df['race']=='Hispanic/Latino'),'ethnicity']='Hispanic/Latino'
race_df.loc[(race_df['race']=='Unknown'),'ethnicity']='Unknown'
race_df.columns=['race','ethnicity']
race_df.loc[(race_df['race']=='Hispanic/Latino'),'ethnicity']='Hispanic/Latino'
race_df.loc[(race_df['race']=='Unknown'),'ethnicity']='Unknown'
for col in race_df:
race_df[col]=race_df[col].astype('category')
deduped_covid_sample_df=deduped_covid_sample_df.merge(race_df,left_index=True, right_index=True,suffixes=('_orig', '_race'))
non_rep_dt=[x for x in deduped_covid_sample_df.columns if x!='race_ethnicity_combined']
deduped_covid_sample_df=deduped_covid_sample_df[non_rep_dt]
non_ethn_df=[x for x in deduped_covid_sample_df.columns if x!='ethnicity']
deduped_covid_sample_df=deduped_covid_sample_df[non_ethn_df]
deduped_covid_sample_df['days_until_onset']=(deduped_covid_sample_df['onset_dt'] - deduped_covid_sample_df['cdc_case_earliest_dt']).dt.days
deduped_covid_sample_df.loc[(deduped_covid_sample_df['days_until_onset'].isna()),'onset_present']='False'
deduped_covid_sample_df.loc[~(deduped_covid_sample_df['days_until_onset'].isna()),'onset_present']='True'
del deduped_covid_sample_df['onset_dt']
categorical_columns=deduped_covid_sample_df.select_dtypes('category').columns
datetime_columns=deduped_covid_sample_df.select_dtypes('datetime').columns
numeric_columns=deduped_covid_sample_df.select_dtypes('float')
deduped_covid_sample_df.to_csv("03_cleaned_24032021-covid19-cdc-deathyn-recent-10k.csv", index_label=False)
try:
deduped_covid_sample_df.to_pickle("03_cleaned_24032021-covid19-cdc-deathyn-recent-10k.csv"[:-3]+'pickle')
except:
print('You need to install the pickle module appropriately.')
categorical_columns=['current_status', 'sex', 'age_group', 'hosp_yn', 'icu_yn', 'death_yn',
'medcond_yn', 'race']
datetime_columns=['cdc_case_earliest_dt']
numeric_columns=['days_until_onset']
#Read Pickle
try:
adf=pd.read_pickle("03_cleaned_24032021-covid19-cdc-deathyn-recent-10k.csv"[:-3]+'pickle')
#You don't have pickle available
except:
adf=pd.read_csv("03_cleaned_24032021-covid19-cdc-deathyn-recent-10k.csv",dtype=str)
data_convert(adf,'datetime',datetime_columns,datetime_format)
data_convert(adf,'category',categorical_columns,datetime_format)
data_convert(adf,'numeric',numeric_columns,datetime_format)
adf['cdc_case_earliest_day']=(adf['cdc_case_earliest_dt'].dt.day).astype('category')
adf['cdc_case_earliest_weekday']=(adf['cdc_case_earliest_dt'].dt.weekday).astype('category')
adf['cdc_case_earliest_month']=(adf['cdc_case_earliest_dt'].dt.month).astype('category')
adf['cdc_case_earliest_year']=(adf['cdc_case_earliest_dt'].dt.year).astype('category')
#Some demographic info is missing
adf.loc[(adf['sex']=='Unknown')|(adf['age_group']=='Unknown')|(adf['race']=='Unknown'),'demographic_missing']='True'
adf.loc[(adf['demographic_missing']!='True'),'demographic_missing']='False'
#Some demographic info is missing - Death is never missing
adf.loc[(adf['hosp_yn']=='Unknown')|(adf['icu_yn']=='Unknown')|(adf['medcond_yn']=='Unknown'),'medical_missing']='True'
adf.loc[(adf['medical_missing']!='True'),'medical_missing']='False'
adf['medical_missing']=adf['medical_missing'].astype('category')
adf['demographic_missing']=adf['demographic_missing'].astype('category')
adf.to_csv("04_extended_24032021-covid19-cdc-deathyn-recent-10k.csv", index_label=False)
try:
adf.to_pickle("04_extended_24032021-covid19-cdc-deathyn-recent-10k.csv"[:-3]+'pickle')
except:
print('You need to install the pickle module appropriately.')
return adf
Get my new cleansed ADF data
adf_new_data_df=assignment_1_cleansing(file="24032021-covid19-cdc-deathyn-recent-10k.csv")
Inside ingest_orig_covid_data(24032021-covid19-cdc-deathyn-recent-10k.csv,dictionary)
cdc_case_earliest_dt cdc_report_dt pos_spec_dt onset_dt \
0 2021/01/22 2021/01/22 NaN NaN
1 2021/01/26 NaN NaN NaN
2 2021/02/03 2021/02/05 NaN 2021/02/03
3 2021/02/05 2021/02/05 2021/02/07 2021/02/05
4 2021/01/27 2021/01/27 NaN NaN
... ... ... ... ...
9995 2021/01/20 NaN NaN NaN
9996 2021/02/01 2021/02/01 NaN 2021/02/01
9997 2021/01/29 2021/02/03 NaN NaN
9998 2021/01/28 2021/01/28 NaN NaN
9999 2021/01/24 2021/02/03 NaN 2021/01/24
current_status sex age_group \
0 Probable Case Female 0 - 9 Years
1 Laboratory-confirmed case Female 30 - 39 Years
2 Laboratory-confirmed case Female 40 - 49 Years
3 Laboratory-confirmed case Male 40 - 49 Years
4 Laboratory-confirmed case Female 40 - 49 Years
... ... ... ...
9995 Laboratory-confirmed case Male 30 - 39 Years
9996 Laboratory-confirmed case Female 50 - 59 Years
9997 Laboratory-confirmed case Female 10 - 19 Years
9998 Laboratory-confirmed case Male 20 - 29 Years
9999 Laboratory-confirmed case Male 40 - 49 Years
race_ethnicity_combined hosp_yn icu_yn death_yn medcond_yn
0 Asian, Non-Hispanic No Missing No Missing
1 Unknown Unknown Missing No Missing
2 Asian, Non-Hispanic Missing Missing No Missing
3 Hispanic/Latino No Unknown No No
4 White, Non-Hispanic No Missing No Missing
... ... ... ... ... ...
9995 Unknown Missing Missing No Missing
9996 Unknown No Missing No Missing
9997 White, Non-Hispanic Missing Missing No Missing
9998 Unknown Missing Missing No Missing
9999 White, Non-Hispanic No Missing No Missing
[10000 rows x 12 columns]
Your file contains:
10000 rows x 12 columns.
The following columns are present:
"cdc_case_earliest_dt"
"cdc_report_dt"
"pos_spec_dt"
"onset_dt"
"current_status"
"sex"
"age_group"
"race_ethnicity_combined"
"hosp_yn"
"icu_yn"
"death_yn"
"medcond_yn"
The columns in this data sample match the CDCs schema
Inside data_convert()
Converting to datetime
Inside data_convert()
Converting to category
Inside data_convert()
No need to convert
| cdc_case_earliest_dt | cdc_report_dt | pos_spec_dt | onset_dt | current_status | sex | age_group | race_ethnicity_combined | hosp_yn | icu_yn | death_yn | medcond_yn | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2021-01-22 | 2021-01-22 | NaT | NaT | Probable Case | Female | 0 - 9 Years | Asian, Non-Hispanic | No | Missing | No | Missing |
| 1 | 2021-01-26 | NaT | NaT | NaT | Laboratory-confirmed case | Female | 30 - 39 Years | Unknown | Unknown | Missing | No | Missing |
| 2 | 2021-02-03 | 2021-02-05 | NaT | 2021-02-03 | Laboratory-confirmed case | Female | 40 - 49 Years | Asian, Non-Hispanic | Missing | Missing | No | Missing |
| 3 | 2021-02-05 | 2021-02-05 | 2021-02-07 | 2021-02-05 | Laboratory-confirmed case | Male | 40 - 49 Years | Hispanic/Latino | No | Unknown | No | No |
| 4 | 2021-01-27 | 2021-01-27 | NaT | NaT | Laboratory-confirmed case | Female | 40 - 49 Years | White, Non-Hispanic | No | Missing | No | Missing |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 9995 | 2021-01-20 | NaT | NaT | NaT | Laboratory-confirmed case | Male | 30 - 39 Years | Unknown | Missing | Missing | No | Missing |
| 9996 | 2021-02-01 | 2021-02-01 | NaT | 2021-02-01 | Laboratory-confirmed case | Female | 50 - 59 Years | Unknown | No | Missing | No | Missing |
| 9997 | 2021-01-29 | 2021-02-03 | NaT | NaT | Laboratory-confirmed case | Female | 10 - 19 Years | White, Non-Hispanic | Missing | Missing | No | Missing |
| 9998 | 2021-01-28 | 2021-01-28 | NaT | NaT | Laboratory-confirmed case | Male | 20 - 29 Years | Unknown | Missing | Missing | No | Missing |
| 9999 | 2021-01-24 | 2021-02-03 | NaT | 2021-01-24 | Laboratory-confirmed case | Male | 40 - 49 Years | White, Non-Hispanic | No | Missing | No | Missing |
10000 rows × 12 columns
There are 1681 duplicates.
Random Forest and XGBoost functions to work over the full Historical Dataset. New file names.
def create_full_RandomForest_model(train_df,test_df,target_column,plot_comp, threshhold_class=0.5, assess=True, verbose=True):
"""Create a random forest model - To be used on the full data set.
Note: The variable names say linear but that's only because it was first built for a linear model. Observe it does initialise the correct model type."""
#Test
X_test=test_df.drop([target_column], axis=1)
y_test= test_df[target_column]
#Train
X_train=train_df.drop([target_column], axis=1)
y_train=train_df[target_column]
#Create the DT Regression
lin_regression_model = RandomForestClassifier(n_estimators=100, max_features='auto', oob_score=True, random_state=14395076)
#Fit the data
lin_regression_model.fit(X_train,y_train)
#Check the predictions
lin_prediction = lin_regression_model.predict(X_test)
#Classify them - Note: Not needed for logistic but keeping as no impact
linear_prediction_classified=np.where(lin_prediction>=threshhold_class,1,0)
#Plot it
if plot_comp:
#Original Versus Prediction
print("Full Random Forest: The Original Vs Predicted Result Is:")
plt.figure(figsize=(50,20))
#Each test is a point
x_axis = range(len(y_test))
plt.plot(x_axis, y_test, label="Original")
#Add second plot for perdiction Class
plt.plot(x_axis, linear_prediction_classified, label="Predicted")
plt.title("COVID test and predicted data")
plt.legend()
plt.savefig('./full_data_randomforest_pred_vs_orig_covid.png')
plt.show()
#Save the Model
filename = './full_data_randomforest_model_covid.pickle'
pickle.dump(lin_regression_model, open(filename, 'wb'))
feature_imp=pd.DataFrame({'feature': X_train.columns, 'importance': lin_regression_model.feature_importances_})
feature_imp=feature_imp.set_index('feature')
feature_imp=feature_imp.sort_values('importance', axis=0, ascending=False)
if plot_comp:
print("Random Forest: Feature Importance:")
plt.figure(figsize=(50,20))
#Plot from DF
feature_imp.plot(kind='barh')
#Add second plot for perdiction Class
plt.title("Random Forest Feature Importance")
plt.legend()
plt.savefig('./full_data_randomforest_importance_covid.png')
plt.show()
#Results
pred_vs_act_df=pd.DataFrame({'Actual':y_test
,'Predicted':lin_prediction
,'PredictionClass':linear_prediction_classified})
pred_vs_act_df['Predicted']=pred_vs_act_df['Predicted']
pred_vs_act_df['Diff']=pred_vs_act_df['Actual']-pred_vs_act_df['PredictionClass']
#Metrics
model_metric=model_metrics(testActualVal=y_test, predictions=linear_prediction_classified, verbose=True)
result_dict={}
result_dict['Model']=lin_regression_model
result_dict['Model_Coefficients']=None#zip(X_train.columns,lin_regression_model.coef_)
result_dict['Actual vs Prediction']=pred_vs_act_df
result_dict['RMSE']=model_metric['RMSE']
result_dict['MSE']=model_metric['MSE']
result_dict['MAE']=model_metric['MAE']
result_dict['Accuracy']=model_metric['Accuracy']
result_dict['Confusion']=model_metric['Confusion']
result_dict['ClassificationRep']=model_metric['ClassificationRep']
result_dict['FeatureImportance']=feature_imp
return result_dict
def create_Full_xgboost_model(train_df,test_df,target_column,plot_comp=True,plot_tree=True, threshhold_class=0.5):
"""Create an xgboostmodel"""
X_test=test_df.drop([target_column], axis=1)
y_test= test_df[target_column]
X_train=train_df.drop([target_column], axis=1)
y_train=train_df[target_column]
#Paramter Dictionary for hypertuning
model_parameters = {'nthread':[2],
'objective':['binary:logistic'],
'learning_rate': [.03, 0.05, .07],
'max_depth': [5, 6, 7, 8],
'min_child_weight': [4],
'subsample': [0.7],
'colsample_bytree': [0.7],
'n_estimators': [500]}
#Create the XGBoost Classifier - Set loss function as logistic
xg_regression_model = xg.XGBClassifier(objective ='binary:logistic',verbosity = 0)
#Hypertune
grid = GridSearchCV(xg_regression_model, model_parameters)
grid.fit(X_train, y_train)
best_parameters=grid.best_params_
xg_regression_model = grid.best_estimator_
#Score the model
score=xg_regression_model.score(X_train,y_train)
print("Model Training Score: {}%".format(score*100))
#Check the predictions
model_prediction = xg_regression_model.predict(X_test)
#Classify them
model_prediction_classified=np.where(model_prediction>=threshhold_class,1,0)
if plot_comp:
#Original Versus Prediction
print("The Original Vs Predicted Result Is:")
plt.figure(figsize=(50,20))
x_axis = range(len(y_test))
plt.plot(x_axis, y_test, label="Original")
plt.plot(x_axis, model_prediction_classified, label="Predicted")
plt.title("COVID test and predicted data")
plt.legend()
plt.savefig('full_xg_pred_vs_orig_covid.png')
plt.show()
filename = './full_xg_model_covid.pickle'
xg_regression_model.save_model(filename)
#Results
pred_vs_act_df=pd.DataFrame({'Actual':y_test
,'Predicted':model_prediction
,'PredictionClass':model_prediction_classified})
pred_vs_act_df['Predicted']=pred_vs_act_df['Predicted']
pred_vs_act_df['Diff']=pred_vs_act_df['Actual']-pred_vs_act_df['PredictionClass']
#Metrics
model_metric=model_metrics(testActualVal=y_test, predictions=model_prediction_classified, verbose=True)
result_dict={}
result_dict['Model']=xg_regression_model
result_dict['Actual vs Prediction']=pred_vs_act_df
result_dict['Model_Coefficients']=None
result_dict['RMSE']=model_metric['RMSE']
result_dict['MSE']=model_metric['MSE']
result_dict['MAE']=model_metric['MAE']
result_dict['Accuracy']=model_metric['Accuracy']
result_dict['Confusion']=model_metric['Confusion']
result_dict['ClassificationRep']=model_metric['ClassificationRep']
result_dict['CrossVal_RMSE_MEAN']=None
result_dict['CrossVal_RMSE_STD']=None
print("Importance by Booster Plot")
xg.plot_importance(xg_regression_model.get_booster())
print("Importance by Weight:")
xg_regression_model.get_booster().get_score(importance_type='weight')
if plot_tree:
#Visualisations, sometimes not great
try:
#Tree Plot
print("The Tree Is:")
fig, ax = plt.subplots(figsize=(100, 100))
xg.plot_tree(xg_regression_model,num_trees=2,ax=ax)
plt.savefig('xg_tree_covid.png')
plt.show()
except Exception as e:
print(e)
return result_dict
Manipulate old and new data with best model params.
best_predictive_features=["age_group"
,"hosp_yn"
,"icu_yn"
,"medcond_yn"]
best_keep_features=[target_column]
for column in best_predictive_features:
best_keep_features+=[column]
final_modelling_df=raw_df[best_keep_features]
#Transform target feature
final_modelling_df['death_yn']=final_modelling_df['death_yn'].astype(str)
final_modelling_df.loc[(final_modelling_df['death_yn']=='Yes'),'death_yn']=1
final_modelling_df.loc[(final_modelling_df['death_yn']=='No'),'death_yn']=0
final_modelling_df['death_yn']=final_modelling_df['death_yn'].astype(int)
final_modelling_dummy_df=pd.get_dummies(final_modelling_df, columns=best_predictive_features, drop_first=True)
final_test_modelling_df=adf_new_data_df[best_keep_features]
final_test_modelling_df['death_yn']=final_test_modelling_df['death_yn'].astype(str)
final_test_modelling_df.loc[(final_test_modelling_df['death_yn']=='Yes'),'death_yn']=1
final_test_modelling_df.loc[(final_test_modelling_df['death_yn']=='No'),'death_yn']=0
final_test_modelling_df['death_yn']=final_test_modelling_df['death_yn'].astype(int)
final_test_modelling_dummy_df=pd.get_dummies(final_test_modelling_df, columns=best_predictive_features, drop_first=True)
final_test_modelling_dummy_df['hosp_yn_OTH']=0
final_test_modelling_dummy_df=final_test_modelling_dummy_df[['death_yn', 'age_group_10 - 19 Years', 'age_group_20 - 29 Years',
'age_group_30 - 39 Years', 'age_group_40 - 49 Years',
'age_group_50 - 59 Years', 'age_group_60 - 69 Years',
'age_group_70 - 79 Years', 'age_group_80+ Years', 'age_group_Unknown',
'hosp_yn_OTH', 'hosp_yn_Unknown', 'hosp_yn_Yes', 'icu_yn_Unknown',
'icu_yn_Yes', 'medcond_yn_Unknown', 'medcond_yn_Yes']]
<ipython-input-55-0467b0188898>:13: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy final_modelling_df['death_yn']=final_modelling_df['death_yn'].astype(str) /opt/miniconda3/envs/comp30830/lib/python3.9/site-packages/pandas/core/indexing.py:1720: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy self._setitem_single_column(loc, value, pi) <ipython-input-55-0467b0188898>:16: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy final_modelling_df['death_yn']=final_modelling_df['death_yn'].astype(int) <ipython-input-55-0467b0188898>:23: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy final_test_modelling_df['death_yn']=final_test_modelling_df['death_yn'].astype(str) <ipython-input-55-0467b0188898>:26: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy final_test_modelling_df['death_yn']=final_test_modelling_df['death_yn'].astype(int)
Random Forest
rf_new=create_full_RandomForest_model(final_modelling_dummy_df,final_test_modelling_dummy_df,target_column,plot_comp=True, threshhold_class=0.5, assess=True, verbose=True)
Full Random Forest: The Original Vs Predicted Result Is:
Random Forest: Feature Importance:
<Figure size 3600x1440 with 0 Axes>
----REPORT----
MAE: 0.0139
MSE: 0.0139
RMSE: 0.11789826122551596
R2: -0.1628305806038184
----DETAIL----
Accuracy:
0.9861
Confusion matrix:
[[9826 53]
[ 86 35]]
Classification report:
{'0': {'precision': 0.9913236481033091, 'recall': 0.994635084522725, 'f1-score': 0.992976605527765, 'support': 9879}, '1': {'precision': 0.3977272727272727, 'recall': 0.2892561983471074, 'f1-score': 0.3349282296650718, 'support': 121}, 'accuracy': 0.9861, 'macro avg': {'precision': 0.6945254604152908, 'recall': 0.6419456414349162, 'f1-score': 0.6639524175964184, 'support': 10000}, 'weighted avg': {'precision': 0.9841411319612591, 'recall': 0.9861, 'f1-score': 0.9850142201798264, 'support': 10000}}
new_rf_class_rep=pd.DataFrame(rf_new['ClassificationRep'])
display(new_rf_class_rep)
print("The first 10 results predicted on the test set:")
display(rf_new['Actual vs Prediction'].head(10))
print("We got {:.2f}% correct".format(100*len(
rf_new['Actual vs Prediction'][rf_new['Actual vs Prediction']['Diff']==0])/len(
rf_new['Actual vs Prediction'])))
print("We correctly predicted {} out of {} deaths.".format(
len(rf_new['Actual vs Prediction'][(rf_new['Actual vs Prediction']['Diff']==0)&(rf_new['Actual vs Prediction']['Actual']==1)]),len(rf_new['Actual vs Prediction'][(rf_new['Actual vs Prediction']['Actual']==1)])))
print("We correctly predicted {} out of {} lives.".format(
len(rf_new['Actual vs Prediction'][(rf_new['Actual vs Prediction']['Diff']==0)&(rf_new['Actual vs Prediction']['Actual']==0)]),len(rf_new['Actual vs Prediction'][(rf_new['Actual vs Prediction']['Actual']==0)])))
display()
| 0 | 1 | accuracy | macro avg | weighted avg | |
|---|---|---|---|---|---|
| precision | 0.991324 | 0.397727 | 0.9861 | 0.694525 | 0.984141 |
| recall | 0.994635 | 0.289256 | 0.9861 | 0.641946 | 0.986100 |
| f1-score | 0.992977 | 0.334928 | 0.9861 | 0.663952 | 0.985014 |
| support | 9879.000000 | 121.000000 | 0.9861 | 10000.000000 | 10000.000000 |
The first 10 results predicted on the test set:
| Actual | Predicted | PredictionClass | Diff | |
|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 |
| 2 | 0 | 0 | 0 | 0 |
| 3 | 0 | 0 | 0 | 0 |
| 4 | 0 | 0 | 0 | 0 |
| 5 | 0 | 0 | 0 | 0 |
| 6 | 0 | 0 | 0 | 0 |
| 7 | 0 | 0 | 0 | 0 |
| 8 | 0 | 0 | 0 | 0 |
| 9 | 0 | 0 | 0 | 0 |
We got 98.61% correct We correctly predicted 35 out of 121 deaths. We correctly predicted 9826 out of 9879 lives.
XGBoost
xg_new=create_Full_xgboost_model(final_modelling_dummy_df,final_test_modelling_dummy_df,target_column,plot_comp=True, threshhold_class=0.5)
Model Training Score: 96.91034771679932% The Original Vs Predicted Result Is:
----REPORT----
MAE: 0.0137
MSE: 0.0137
RMSE: 0.11704699910719625
R2: -0.14609920534333165
----DETAIL----
Accuracy:
0.9863
Confusion matrix:
[[9829 50]
[ 87 34]]
Classification report:
{'0': {'precision': 0.9912263009277935, 'recall': 0.9949387589837028, 'f1-score': 0.99307906036878, 'support': 9879}, '1': {'precision': 0.40476190476190477, 'recall': 0.2809917355371901, 'f1-score': 0.3317073170731707, 'support': 121}, 'accuracy': 0.9863, 'macro avg': {'precision': 0.6979941028448491, 'recall': 0.6379652472604465, 'f1-score': 0.6623931887209753, 'support': 10000}, 'weighted avg': {'precision': 0.9841300817341863, 'recall': 0.9863, 'f1-score': 0.9850764622749031, 'support': 10000}}
Importance by Booster Plot
Importance by Weight:
The Tree Is:
new_xg_class_rep=pd.DataFrame(xg_new['ClassificationRep'])
display(new_xg_class_rep)
print("The first 10 results predicted on the test set:")
display(xg_new['Actual vs Prediction'].head(10))
print("We got {:.2f}% correct".format(100*len(
xg_new['Actual vs Prediction'][xg_new['Actual vs Prediction']['Diff']==0])/len(
xg_new['Actual vs Prediction'])))
print("We correctly predicted {} out of {} deaths.".format(
len(xg_new['Actual vs Prediction'][(xg_new['Actual vs Prediction']['Diff']==0)&(xg_new['Actual vs Prediction']['Actual']==1)]),len(xg_new['Actual vs Prediction'][(xg_new['Actual vs Prediction']['Actual']==1)])))
print("We correctly predicted {} out of {} lives.".format(
len(xg_new['Actual vs Prediction'][(xg_new['Actual vs Prediction']['Diff']==0)&(xg_new['Actual vs Prediction']['Actual']==0)]),len(xg_new['Actual vs Prediction'][(xg_new['Actual vs Prediction']['Actual']==0)])))
display()
| 0 | 1 | accuracy | macro avg | weighted avg | |
|---|---|---|---|---|---|
| precision | 0.991226 | 0.404762 | 0.9863 | 0.697994 | 0.984130 |
| recall | 0.994939 | 0.280992 | 0.9863 | 0.637965 | 0.986300 |
| f1-score | 0.993079 | 0.331707 | 0.9863 | 0.662393 | 0.985076 |
| support | 9879.000000 | 121.000000 | 0.9863 | 10000.000000 | 10000.000000 |
The first 10 results predicted on the test set:
| Actual | Predicted | PredictionClass | Diff | |
|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 |
| 2 | 0 | 0 | 0 | 0 |
| 3 | 0 | 0 | 0 | 0 |
| 4 | 0 | 0 | 0 | 0 |
| 5 | 0 | 0 | 0 | 0 |
| 6 | 0 | 0 | 0 | 0 |
| 7 | 0 | 0 | 0 | 0 |
| 8 | 0 | 0 | 0 | 0 |
| 9 | 0 | 0 | 0 | 0 |
We got 98.63% correct We correctly predicted 34 out of 121 deaths. We correctly predicted 9829 out of 9879 lives.
new_result_df=compare_models(model_report_list=[new_rf_class_rep.T,new_xg_class_rep.T]
, keys=["RandomForest","XGBoost"])
Comparison Dataframe is:
| Model | Type | precision | recall | f1-score | support | |
|---|---|---|---|---|---|---|
| 0 | RandomForest | 0 | 0.991324 | 0.994635 | 0.992977 | 9879.0000 |
| 1 | RandomForest | 1 | 0.397727 | 0.289256 | 0.334928 | 121.0000 |
| 2 | RandomForest | accuracy | 0.986100 | 0.986100 | 0.986100 | 0.9861 |
| 3 | RandomForest | macro avg | 0.694525 | 0.641946 | 0.663952 | 10000.0000 |
| 4 | RandomForest | weighted avg | 0.984141 | 0.986100 | 0.985014 | 10000.0000 |
| 5 | XGBoost | 0 | 0.991226 | 0.994939 | 0.993079 | 9879.0000 |
| 6 | XGBoost | 1 | 0.404762 | 0.280992 | 0.331707 | 121.0000 |
| 7 | XGBoost | accuracy | 0.986300 | 0.986300 | 0.986300 | 0.9863 |
| 8 | XGBoost | macro avg | 0.697994 | 0.637965 | 0.662393 | 10000.0000 |
| 9 | XGBoost | weighted avg | 0.984130 | 0.986300 | 0.985076 | 10000.0000 |
We see both XGboost and RF models perform worse on the new historical data set compared to the test results. Random Forest ended up slightly better (although both were very close) in macro avg f1 score.
Based on the drop in sensitivity in predicting deaths on the new dataset, it is likely that time is an important factor in the outcome of COVID. This makes sense as the original data includes cases from the beginning of the pandemic and probable cases, where mortality was high due to a poor understanding of the disease and uncertainty in how to treat it. As time has improved, the there has been a greater development in understanding of what factors are significant.
Due to this time effect, it is likely important to refresh the models and re-examine the features which are used to build the model, likely to account for the stage of the pandemic in which the diagnosis was featured, in order to account for this. This would likely lead to a higher degree of accuracy over the new dataset.
Ultimately, while neither model is optimal by any means, both models are good as an indicator and better than guessing. It would be important in the productionising of these models to ensure it is very clearly noted that the result is only an indicator as we still see that it fails to classify every death case.
Personally, because of the ethical considerations, I would advise that the model is not deployed unless the death classification can be significantly improved either by collecting vastly more data or by getting a better dataset to work with (particularly involving patient history or regional data).